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ABSTRACT 


This paper addresses the problem of efficiently computing the motor torques required to 
drive a lower-pair kinematic chain (e.g., a typical manipulator arm in free motion, or a 
mechanical leg in the swing phase) given the desired trajectory; i.e., the Inverse Dynamics 
problem. It investigates the high degree of parallelism inherent in the computations, and 
presents two “mathematically exact" formulations especially suited to high-speed, highly 
parallel implementations using special-purpose hardware or VLSI devices. In principle, the 
formulations should permit the calculations to run at a speed bounded only by I/O. The 
first presented is a parallel version of the recent linear Newton-Euler recursive algorithm. 
The time cost is also linear in the number of joints, but the real-time coefficients are 
reduced by almost two orders of magnitude. The second formulation reports a new parallel 
algorithm which shows that it is possible to improve upon the linear time dependency. The 
real time required to perform the calculations increases only as the |"log 2 ] of the number 
of joints. Either formulation is susceptible to a systolic pipelined architecture in which 
complete sets of joint torques emerge at successive intervals of four floating-point operations. 
Hardware requirements necessary to support the algorithm are considered and found not to 
be excessive, and a VLSI implementation architecture is suggested. We indicate possible 
applications to incorporating dynamical considerations into trajectory planning, e.g. it may 
be possible to build an on-line trajectory optimizer. 
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0. OVERVIEW 


The Inverse Dynamics problem consists (loosely) of computing the motor torques necessary 
to drive a mechanical manipulator through a specified motion: given that you know where it 
should go, compute what you have to do to get it to go there. Recently, efficient recursive 
formulations have been developed using both Newton-Euler and Lagrangian dynamics 
[16][29], These have reduced the number of additions and multiplications (for n joints) 
from an 0(n /| ) dependency (see Table 0.1) to one linear in the number of joints, 0(n), 
requiring 


(150n — 48) Mults -f (131n — 48) Addns (Newton-Euler) 

multiplications (Mults) and additions (Addns) when performed serially. 

This paper investigates the high degree of parallelism inherent in the calculations, and 
presents two formulations especially suited to highly parallel implementations using special- 
purpose hardware or VLSI devices. Table 0.1 shows the improvement over serial implemen¬ 
tations. (Note that this reflects the algorithmically indicated cost, as in Hollerbach[16]; see 
the discussion at the beginning of Section III.) 

The first formulation is again linear in the number of joints, but reduces the real-time 
coefficients by almost two orders of magnitude to 

(2n -f- 3) Mults + (6n -(- 7) Addns (Newton-Euler) 

The second formulation shows that by exploiting a novel parallel algorithm developed 
below, the time required to perform the calculations increases only as O(log(n)). The time 
dependencies are 

(2flog 2 (n + 1)] + 5) Mults + (6[log 2 (n -f 1)] -j- 10) Addns (Newton-Euler) 

Either formulation is susceptible to a systolic pipelined architecture. We show below 
that the basic time cycle of the algorithm is 1 Mult + 3 Addns. Thus, after the first complete 
set of joint torques had emerged from the pipeline, successive sets would appear at intervals 
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of four floating point operations (4 Flops). This yields the ability to rapidly and efficiently 
evaluate a large number of alternatives. 

Section I contains a brief introduction to the problem and review of previous work in 
the area. 

Section II explains in detail the notation used in this paper. It is essentially an adaptation 
of the notation used by Hollerbach[16] and Luh et al.[29], which in turn derives from the 
Denavit-Hartenberg[10] convention for lower-pair linkages through Uicker[43] and Kahn[19]. 
The reader already broadly familiar with this notation should at least review Table 11.1 where 
the notation is summarized. 

Section III explicates the first approach considered, yielding an 0(n) formulation with 
greatly reduced coefficients. Luh et al.[29] give the recursive form of the Newton-Euler 
formulation as shown in Table 1.1. Many of the computations associated with any given 
joint (node) may proceed concurrently. For example, the computations of the variables 
w, (denoted by (*) in Table 1.1) and w, (**) do not interact (given that w,_! and w,_i 
have already been computed) and hence may be performed at the same time by different 
sub-processors. Additionally, different sub expressions of the same variable may often 
be computed concurrently by different sub-processors. Finally, by locally pipelining the 
recursive variables additional speed may be gained; e.g., the computation of w, +] may be 
started before the computation of w, has finished. (See Tables III. 1 and III.2). Essentially, 
this formulation arises from trying to compute as much as possible as early as possible, and 
still remains within a basic linear structure. 

Section IV shows the derivation of an O(log(n)) time dynamics formulation. This arises 
from restructuring the fundamental framework within which computation proceeds, together 
with a corresponding revision to the recursive equations. The linear recursive algorithm is, 
conceptually, a formalism for beginning at the manipulator base and propagating desired 
motion outward link by link to the tip, then propagating tip environmental forces and torques 
back inward link by link to the base (determining the needed joint motor torques along the 
way). (See Figure 111.1). In contrast, the logarithmic recursive algorithm is a mechanism for 
recursively propagating desired motion (or, forces and torques) between any two adjacent 
groups of links. Conceptually speaking, the propagation of desired motion from base to tip 
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is accomplished by grouping together adjacent links on the first step to form (n/2) groups 
of two links each, then on each succeeding step grouping adjacent pairs of groups together 
until after [log 2 (n)l steps there is one group encompassing all links (actually, the intermediate 
groups are also formed). It is analogous to summing n numbers in [log 2 (n)] steps by adding 
together first adjacent pairs, then adjacent pairs of pairs, and so forth. (See Figures IV.1 
and IV.2). 

A synthesis of the two approaches is presented in Section V. The techniques of Section 
III for exploiting parallelism within a node are applied to the 0(log(n)) time structure of 
Section IV, yielding an O(log(n)) formulation with reduced coefficients. 

Ultimately the algorithm must be expressed in hardware, and Section VI addresses 
a few words to potential implementations. The principle thrust of this paper lies in the 
analytic formulations above, and we will consider hardware only to the extent of showing 
that physical implementations are reasonable and feasible. We consider the total number 
of processors, buffering of intermediate results, and internal communication, but only to the 
extent of showing that the requirements are reasonable. 

Construction of suitable hardware using today’s technology argues for a special-purpose 
VLSI chip, and Section VII presents one architecture suitable for this purpose. A primitive 
module consisting of two multipliers, one adder, and some registers is sufficient to support 
all of the computation required. Several such primitive modules may then be assembled, 
together with a suitable control structure, to produce a matrix-vector arithmetic module. 
These may then be connected into a network corresponding to the communication structure 
of the algorithm, and each module programmed by the host computer to execute the 
operation and communication sequencing necessary to implement the algorithm (or for that 
matter, any other high-speed straight-line matrix-vector computation). 

Finally, Section VIII will very briefly allude to, without discussing in depth, a few possible 
extensions to this work. The ability to efficiently pipeline implies that a number of considered 
variations on the same basic manipulator trajectory could be explored in parallel, and the 
one having the most satisfactory dynamical characteristics for actual execution chosen from 
among that set. It may even be possible to build an on-line “optimizing trajectory compiler” 
in which the desired motion (trajectory) for the next several time periods is pre-planned, 
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the motor torques automatically generated, and the time sequence inspected slightly before 
the manipulator has actually arrived at the trajectory points, thereby incorporating some 
dynamical considerations into trajectory planning. Poggio has a result indicating that neural 
structures could perform an arithmetic multiplication in about a millisecond, which implies 
that in principle it would be possible to compute the Inverse Dynamics in approximately real 
time using a suitable neural structure. We close with a few remarks concerning generalization 
of the O(log(n)) embedding to other recursive algorithms. 
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Table 0.1 — Comparison of Dynamics Formulations 
(adapted from Hollerbach[16]) 


0.1 a — Comparison of Time Depenc 

encies 

Method 

Multiplications 

Additions 

Uicker/Kahn 
(original Lagrangian) 

32£n 4 + 86&n 3 + 171}n 2 
+ 53 Jn— 128 

25n 4 + 66|n 3 + 129^n 2 
+42 Jn — 96 

Waters 

(partially recursive) 

106|n 2 + 620^n — 512 

82n 2 + 514n — 384 

Hollerbach 
(4x4 Lagrangian) 

830n - 592 

675n — 464 

Hollerbach 
(3x3 Lagrangian) 

412n — 277 

320n — 201 

Luh, Walker, Paul 
(Newton-Euler) 

150n — 48 

131n —48 

Horn, Raibert 
(table look-up) 

2 n 3 + n 2 

n 3 + n 2 + 2n 

Luh, Lin 

(scheduled parallel N.E.) 

57 n — 18 
(estimated — see text) 

50n — 18 
(estimated — see text) 

Lathrop 

(linear parallel N.E.) 

2n + 3 

fin + 7 

Lathrop 

(logarithmic parallel N.E.) 

2riog 2 (n +1)1 + 5 

6[log 2 (n + I)] + 10 

Lathrop 

(systolic pipeline) 

1 

(successive — see text) 

3 

(successive — see text) 


* This table reflects the algorithmically indicated cost for the fully general 6-link rotary manipulator, 
as in Hollerbach[16j. By considering special cases, introducing configuration or workspace assump¬ 
tions, or tailoring the computation, additional reductions are possible. See the discussion at the 
beginning of Section III. 



Table 0.1 — Comparison of Dynamics Formulations* 
(adapted from Hollerbach[16]) 


0.1b — Comparison for n = 6 

Method 

Multiplications 

Additions 

Uicker/Kahn 
(original Lagrangian) 

66,271 

51,548 

Waters 

(partially recursive) 

7,051 

5,652 

Hollerbach 
(4x4 Lagrangian) 

4,388 

3,586 

Hollerbach 
(3x3 Lagrangian) 

2,195 

1,719 

Luh, Walker, Paul 
(Newton-Euler) 

852 

738 

Horn, Raibert 
(table look-up) 

468 

264 

Luh, Lin 

(scheduled parallel N.E.) 

323 

(estimated) 

280 

(estimated) 


15 

43 

Lathrop 

(logarithmic parallel N.E.) 

ii 

28 

Lathrop 

(systolic pipeline) 

i 

(successive) 

3 

(successive) 


* This table reflects the algorithmically indicated cost for the fully general 6-link rotary manipulator, 
as in Hollerbach[16j. By considering special cases, introducing configuration or workspace assump¬ 
tions, or tailoring the computation, additional reductions are possible. See the discussion at the 
beginning of Section III. 
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1. INTRODUCTION 


In active articulated mechanisms, including both artificial and biological systems, the 
parameters which one typically can directly control are the forces (in translational joints, 
sometimes called prismatic) and torques (rotational joints, sometimes called revolute) applied 
by the actuators to the joints. Unfortunately, the parameters in which one is frequently 
interested are the linear and rotational accelerations (hence also, velocities and positions). 
This gives rise to two dual problems; both highly complex and non-linear, and both desirable 
to calculate in real time. 

The Direct (or Integral) Dynamics problem is to compute the mapping V from a set 
of applied joint forces and torques (r,, arising from stimulation of the actuators) into the 
resulting linear and rotational joint motions (accelerations $,): 

Computing such a mapping is equivalent to simulating the motion of the mechanism under 
the applied actuator effects. In this case one knows what one does to the thing, and wishes 
to find out where it will go in response. 

The Inverse Dynamics problem is to compute the inverse of the above mapping; given 
the accelerations desired, find the forces/torques necessary: 

Given that one knows where one wants the thing to go, what does one have to do to make 
it go there? This is the question that the Inverse Dynamics seeks to answer. “Where one 
wants it to go” is the desired trajectory, the manipulator configuration as a function of 
time. The configuration may be completely specified by the joint positions, so the trajectory 
may be given by stating each joint position as a function of time (i.e., q{t), where q is an 
n-dimensional vector giving the actual positions q, of the n joints). These functions are 
assumed to be twice time-differentiable to provide joint velocities and accelerations (?(t), 

9(f))- 

The question is usually posed by giving the joint positions and velocities (q(t 0 ), ?(fo)) 
which describe the state of the manipulator at a given point in time (say, t 0 ), together with 
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the joint accelerations which are desired at that point (ij(to))- The answer expected is the 
n motor torques (the n-dimensional vector r(t 0 ) giving the motor torques r,(t 0 ) at each joint 
i) which, if applied to the manipulator at that point in its state-space, would induce the 
desired accelerations. Typically one measures q and q, while q is supplied by a higher-level 
planning and control module. Position and velocity are the time integrals of acceleration, so 
acceleration control suffices, at least in a “mathematically exact” sense. There are a host of 
practical problems which insure that the model and the reality it models never quite match, 
and which we shall relegate to feedback. For control purposes the formalism provides a 
good first approximation to the “inverse plant”, which is close enough to render feedback 
correction feasible. 

Computing the motor torques is quite complicated, however, due to the high degree 
of non-linearity inherent in rigid-body rotational mechanics. The torques supplied must 
compensate for the inertia of the manipulator, gravitational force, the Coriolis and centrifugal 
forces, and viscous friction at the joints. Viscous frictional forces often depend only on q x 
and q, at jointhence they are susceptible to relatively simple correction and will therefore 
be ignored. All of the other terms vary in a non-linear fashion depending on the manipulator 
configuration at a given point in time; additionally, the Coriolis and centrifugal forces also 
depend on all pairwise products {q,q } , 1 < i,j < n) of joint velocities. 

This complicated computation has until recently posed a bottleneck in on-line con¬ 
trol of manipulators, and much effort has been expended in devising more time-efficient 
methods. Typical resonant frequencies of many mechanical manipulators is around 10Hz, 
so the computation must be repeated at about 60 Hz or faster[29]. Uicker[43] and Kahn[19] 
derived an early formulation for an n-link manipulator based on the Lagrange equations. 
This had an 0(n 4 ) time complexity and required 7.9 seconds on a PDP 11/45 to com¬ 
pute the torques for just one point in the trajectory[29]. This was too slow for on-line 
control. Efforts to improve this time have either explored other computational algorithms 
[44],[45],[35],[30],[29],[16], made simplifying assumptions [4],[35], or substituted table look¬ 
up for computation [38],[1],[2]. 

Since only the Coriolis and centrifugal terms involve pairwise products of all joint 
velocities, a common simplification is to just ignore them. Unfortunately this works well only 
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at low velocities where the product terms are small. During fast motion the Coriolis and 
centrifugal terms may dominate the computation [29] to such an extent that attempts to 
control the errors by feedback require excessive corrective torques[38]. 

Table look-up is in principle the fastest method of obtaining the necessary torques. 
If one could index a table of size 0(n 3 ) using the 3 n values of (g*, q it q it 1 < * < n), 
all computation could be replaced by memory references. This extreme was explored by 
Albus[1],[2], Raibert[37] reduced the table size by proposing a table indexed by position 
and velocity (with acceleration handled separately), and this was further refined by Raibert 
and Horn[38] to a table indexed only by position. Nonetheless for fine enough division of 
the table dimensions the size required remains enormous, filling the table and interpolating 
between stored values present problems, and the table is valid only for one particular load 
(e.g., mass of object grasped)[l6]. 

Two other formulations, not reflected in Table 0.1 because particularized to special 
cases and hence not directly comparable, deserve mention. Kane and Levinson[20] discuss 
a formalism (Kane’s Dynamics, originally developed for complex spacecraft control) based on 
generalized coordinates and velocities similar to the Lagrangian approach. The dynamical 
parameters can be represented explicitly so as to exploit simplifications arising from 
manipulator configuration or workspace constraints, though the computational complexity 
therefore reflects both configuration and workspace assumptions. They analyze the Stanford 
arm under the assumption that the workspace never requires the second joint to approach 
0 2 = o° or 9 2 — 180° (due to numerical instabilities there). Hollerbach and Sahar[18] present 
a method of merging the inverse kinematics with the inverse dynamics, particularized to the 
case of a robot with a spherical wrist. Many of the kinematic parameters generated in the 
inverse kinematics are needed in the inverse dynamics, and additional savings arise from 
the simplification of assuming a spherical wrist. Other special cases are discussed in the 
beginning of Section III. 

The most successful formulations of the general inverse dynamics involve recursive 
algorithms. Waters[44] first presented an 0{n 2 ) partial recursive form of the Lagrange 
equations, which was made fully linear recursive by Hollerbach[16]. Hollerbach also contains 
an overview of contrasting approaches to the inverse dynamics problem, to which the 
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interested reader is referred for further details. Orin et al.[34] first presented a linear 
recursive form of the Newton-Euler equations, which was refined by Luh et al.[29]. 

We have considered both the Newton-Euler and the Lagrangian formulations, in the 
form presented by Luh et al.[29] and Hollerbach[16]. Silver[40] has shown them to be 
fundamentally equivalent, differing mainly in that the representation of angular velocity in 
the Newton-Euler equations is more efficient. Reflecting this, the parallel formulations for 
both formalisms are found to require approximately equal parallel time to compute, but the 
Lagrangian formulation would require substantially more hardware to implement. This paper 
will thus present only the Newton-Euler results. Complete details for the Lagrangian case 
may be found in Lathrop[25]. 

The underlying intuition in both cases is the following. The motion (by which we will 
generally mean: position, velocity, and acceleration) of the manipulator base is assumed 
known, as is the motion of each individual joint. By beginning at the base and accounting 
for the (known, desired) joint motion at each joint, the motion of each successive link may 
be cascaded recursively from the base of the manipulator to the tip. Since the forces 
and torques applied by the environment to the last manipulator link (the tip, workhead, or 
end-effector) are known or measured, and the motion of the last link is known, the force 
or torque at the last joint necessary to drive the last link through its desired motion may 
be calculated directly from free-body rotational mechanics. This in turn allows calculation 
of the forces and torques transmitted across the last joint to the next inboard neighbor 
link, which in turn allows calculation of the next but last force or torque from rotational 
mechanics. Continuing in toward the base in this fashion, and accounting for the forces 
and torques passed across each joint, the force or torque necessary at each joint to drive 
each link through its desired motion may be calculated in linear time. The required motor 
force or torque at each joint is then the component of force or torque along or about the 
joint axis. Thus in linear time, the motor torques needed to support a desired motion may 
be computed. 

A number of practical attempts have been made to relieve the host computer of some 
of the computational burden associated with manipulator control. The principle idea is that 
machine management should be performed outside the main computer (CPU). A common 
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strategy is to use a single microprocessor to control the robot, and the host to control the 
controller, e.g. [11],[13], and [23]. This paper will not consider these further because the 
parallelism achieved is minimal, and they do not address arm dynamics. 

Most attempts to apply parallel processing to manipulator control have involved using 
microprocessors to servo individual joints, and have not attempted to include dynamical 
considerations. Typically, this involves the microprocessor as the active element in a joint 
control feedback loop with sensors to monitor the error (usually of joint position). Also, 
the individual joint servos for each joint axis are usually under the control of a master 
microprocessor, which coordinates their actions with commands from the host. In fact, 
usually the servo microprocessors will not communicate with each other directly at all. Shin 
and Malin[39] discuss one control strategy for this general approach. 

Acting under this general paradigm, Cook et al.[9] discuss a configuration of seven 
68000s and a programmable logic controller, designed for welding underwater pipelines. 
Kuo[24] describes an arm mounted on a mobile platform, having one microprocessor per 
motor and based on an AIM-65 system. Rafauli et al.[36] control a modified Unimate 2000, 
using one Intel 8748 to servo each axis based on positional feedback and a master composed 
of an iSBC 86/12A combined with an iSBC 337. Gupta[14] advocates several advantages 
of the MK68000 for similar applications. A single-board controller for a pneumatic drive 
arm, using one microprocessor at each axis with feedback from air pressure and speed 
as well as position, is presented by Goshorn[12]. Carlisle[8] additionally dedicates several 
microprocessors to sub-tasks such as sensor monitoring or I/O, though his interest leans 
somewhai more to computer numerically controlled machines. Distributed manipulator 
control schemes for servo-manipulators used in nuclear reactor maintenance are described 
by Besant[5] and Martin et al.[31 ]. The OSU Hexapod (an 18-degree-of-freedom, motor-driven 
walking machine) is controlled by an experimental multiprocessor consisting of five LSI-1 Is 
[21]. These are reconfigurable so that tree, star, and loop structures can be simulated. This 
multiprocessor has also been used for real-time optimization of leg tip forces, a task which 
is not strictly parallel in nature. 

Mudge and co-workers in several papers outline a scheme whereby the general purpose 
computer incorporates attached special-purpose processors for real-time numerically inten- 
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sive computations. The special processor proposed is a single chip implementation, which 
would interpolate between set points from the host and compute the correction torques for 
each joint of the robot arm, replacing their current PUMA control scheme of one LSI-11/02 
and six 6503s [26]. Though this particular application ignores dynamic coupling between 
joints, another paper [33] proposes a scheme to unify Resolved Motion, Gross Motion, and 
Fine Motion, based on the Newton-Euler formalism, suitable for implementation in real-time 
on their processor. The processor (called by them NP, the Numerical Processor) is described 
[32] as lying between Floating Point Systems’ AP120B (a high performance numerically 
oriented attached processor) and the Intel 8087 (a single chip numerically oriented attached 
processor in the Intel 8087 family). A similar approach is described by Turner et al.[42] 
involving a distributed processing architecture using three microcomputers in a pipelined 
configuration, also proposed for a broad class of advanced manipulator control algorithms. 

In the work most closely related to this paper, Luh and Lin describe a procedure 
for scheduling the sub-tasks of a group of microprocessors computing the Newton-Euler 
dynamics [27],[28]. One microprocessor is again assigned to each controlled joint axis; 
but in contrast to the servo-based approaches above, the microprocessors do communicate 
and arm dynamics are explicitly computed. Each microprocessor computes the recursion 
variables which correspond to its joint axis. Since these variables (as well as intermediate 
partiai resuiis) recursiveiy depend on eacn otner in various aifferent ways, often some 
microprocessor(s) will be idle while waiting for others to complete pending sub-tasks and 
idle time must be included in the schedule. It is a non-trivial scheduling problem to assign 
each sub-task of each microprocessor a specific execution sequence which minimizes the 
global computation time. Because the form of the equations (and hence the sub-tasks to be 
performed) for a joint differs depending on whether that joint is rotational or translational, in 
general each new manipulator configuration requires a new rescheduling of sub-tasks. 

The procedure adopted is a modified branch-and-bound search through the space 
of possible sub-task orderings, terminating when the minimum-time ordering has been 
found. Any feasible ordering which accomplishes all sub-tasks is first found, then refined 
by generating and comparing alternatives (other, partial, orderings from branch (choice) 
points). The estimated total time of a partial task ordering is its partial time so far (including 
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idle time), plus the time for all its remaining sub-tasks to complete if idle time is ignored. 
Since this is guaranteed to be an underestimate of true total time, branch-and-bound search 
applies. Each partial ordering is extended until either its estimated total time exceeds that 
of the minimum-time feasible ordering so far found, or it is extended to a complete feasible 
ordering of less total time and so becomes the new minimum. At the conclusion of this 
procedure, the path of minimum true total time is the optimum schedule. Though by its 
nature this procedure is not subject to precise analysis of the computational complexity of 
the resulting schedules, the authors report a concurrency factor of 2.64 on the Stanford 
arm. This estimated factor is used in Table 0.1. 
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Table 1.1 — Linear Recursive NewUnvEuier Formulation 
(after Luhet al.(2®]) 


Newton-Euler Backward Recursion: 


if joint s rotational; 

(*) 

Afwi-i if joints 


lAi’wj—i 


« tT _i +*<-!$, + w<_i x if joint s rotaMornl; 


(**) 


if joints 


f-Af j»i_j + in x Pi + «i x (w< X Pi) 




if joint s rotational; 


Pi 


(Afj»,-_, + w< x#* + Vi x (w< x p*) + Afai-iii + *•>< x Afar,-i 4. # joint * tranalational. 


r, = Vi X (wi X r*) 4- Vi X r- + 

Jt\ = m<r, '' 

2V< =* 7jWj + «< X (JiVi) [■ 

Newton-Euler Forward Recursion: 

fi*= Fi + Ai+tfi+i 

n% = Ai^ini^x + Ni 'H- •* X Fi + p* X (A*+i/,+i) 

f< ■* *,_i • lli 
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2. NOTATION 


The notation is based on that used by Hollerbach [16] and Luh et al. [29] in their analyses of 
the linear recursive inverse dynamics, which in turn derives from the Denavit and Hartenburg 
[10] convention for lower-pair chains. Coordinate systems are fixed in the body of each 
link and related rotationally by 3 x 3 matrix coordinate transforms and translationally by 
distinguished body-fixed position vectors. The notation 'vj is used to denote the vector vj 
referred to coordinate system 0i. 

It is worth emphasizing that, with few exceptions, ALL vectors in this paper are referred 
to link coordinates. This obviates the complexity and computational overhead of transforming 
everything to base coordinates. Elsewhere in the literature, °v } (denoting the vector vj 
referred to base coordinates) is usually abbreviated simply as vj. Since we will almost 
never refer any vector to other than its own link coordinates, we adopt instead the notation- 
simplifying convention that Vj abbreviates 3 v } , a vector referred to its own coordinate system 
0 ,. 

The links of a mechanism are numbered consecutively 1 to n from base to tip, with link 
0 denoting the base reference frame. There is another fictitious link n-f 1 attached to the tip 
when convenient, which may represent the object grasped or account for environmentally 
applied forces and torques at the workpiece. Attached to link i is a right-handed orthogonal 
coordinate system 0, with axes {x it yi,Zi). 

Joints (equivalently, hinges) occur between the links. Each joint is denoted by the 
number assigned to its distal (outboard) link, so that joint i connects links * — 1 and i. Joints 
may be either rotational or translational, but may have only one degree of freedom. Multiple 
degrees of freedom at a joint are modeled by introducing fictitious links having zero mass 
and length. 

For any two adjacent coordinate systems 0* and _i there is an orthonormal rotation 

matrix A, mapping vectors whose coordinates are referenced to 0, into the corresponding 
vectors referenced to 0,-1. Note that this is a pure rotation. The coordinate systems are 
located in the links so as to simplify the form of this matrix. The orthonormal basis vectors 
of 0i are arranged as follows: 
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z, lies along the positive axis of joint t -f 1. 

x, lies along the common normal from i to z,, 

y, = z, x x, completes the right-handed orthonormal system. 

Since the joints have but one degree of freedom, x„ z t , and *,_] are all body-fixed vectors 
in 0 t (i.e., have a fixed direction relative to the body-fixed coordinate system). The rotational 
orientation of two adjacent systems is completely described by 

a, is the angle between , and z t in a right-handed sense about x it 
e, is the angle between x and x, in a right-handed sense about z t -\. 

The vectors z,— x and z t being both fixed in 0,, a, is constant. Since z,—\ lies along the 
joint axis of joint i, and i,_i and x, are fixed in 0,_j and 0, respectively and both normal to 
0, measures the relative rotation about the joint axis between the two systems. Thus 
if joint i is rotational then 0, is the joint variable, otherwise 0, is constant (see Figure 11.1). 

By the preceding remarks, the rotation matrix A, corresponds to a coordinate rotation 
about x, by an angle a, (which aligns z, and z,— ,) followed by a rotation about the rotated 
z, (= z,—i) by an angle 0, (which aligns the other two pairs of basis vectors). If the first 
rotation is represented by the matrix <!>, and the second by 0,, then 


■10 0 
<3>, — 0 cos a, — sin a, 
lG sin a, cos a; j 
'cos 0 X — sin 0, O' 

0; = sin0; cos0; 0 

- 0 0 1 

A t = 0 t $, 


"cos 0 , — sin 0 t cos a, 

sin 0i cos 0i cos a< 

. 0 sin ai 


sin 0, sin a* ' 
— cos 0{ sin ati 
cos ai 


The translational orientation of two adjacent systems is completely described by 

o, is the distance between the origins of 0,_i and 0 t measured along x^ 

Si is the distance between x,—i and i, measured along *<_i. 

Since x i% z u and z,_i are fixed in 0,, n, is constant. z t —i lies along the joint axis of joint t, 
so if joint % is translational s, will be the joint variable, otherwise Si is constant. 
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The following link » vectors, referenced to 0,, permit a convenient specification of the 
translational relationship between adjacent coordinate systems 
* 

Pi a vector to origin 0, from 0*—i 

r i a vector to the center of mass of link t from 0i 

s , = p* -(- r*, a vector to the center of mass of link * from 0j_i. 

From the above remarks it is clear that 


SiAj 


+ a,x, 


but in virtue of the form of A, = 0,$, and its constituents, we have 


Q j z,-i = z,-i 
Ajz,^ = <J >JeJz t ~i 


which is body-fixed in 0,. Thus p* is composed of a part a,x, which is fixed in 0, and 
represents (constant) translation normal to both z,_i and z„ together with a part s,4>/z t _i 
whose direction is fixed in 0, and whose magnitude represents translation along z,_i referred 
to 0,. If joint i is rotational then s, is constant and so therefore is p*, else p* incorporates 
the result of translational joint motion at joint 

The rotation matrices A t may be cascaded by defining the rotation matrix ‘IV,-, which 
maps ; u m into ' u m . Since the inverse of an orthonormal matrix is equal to its transpose it 
follows that 

1 = { j Wi) T 


A+i 

•••A i 

if i < j 

AJ.. 

A t 

■ A j+ 1 

if i > j 


the superscript T denoting transpose in either matrices or vectors. 
From the above it is clear that 


°Wj = [(“*>), (°»>),(°*y)] 
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and 


'Wj = ( 0 Wj)°Wj 
= C'WfcX^i) 

= [( x j)) (»>), (*>)]• 

In the sections on logarithmic recursion it will be necessary to introduce another notation 
for consistency with the formalisms developed there. We will use W !>; to denote the mapping 
from Oj to Oi—u so that 

= - 1 w J . 

Also in the sections on logarithmic recursion we will slightly abuse the dot notation 
for vector time differentiation (i.e., d> nJ , and p u Elsewhere in the literature this denotes 
differentiation in the inertial frame, with differentiation in a rotating frame indicated by ' or 
*. We will take ALL terms u a , b to denote differentiation of u a ,b in 0„— This becomes 
equivalent to the standard notation at u 0lM the case of interest, and eliminates proliferation 
of ' or * superscripts in much the same spirit that taking v, eliminated superscripts of 0. 
This is explained more fully in Section IV. 
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Table 11.1 — Summary of Notation Used 
We define the following: 

m j the mass of link j, 

* 

r j a vector to the center of mass of link j from the origin 0 3 , 

* 

Pj a vector to the origin 0 3 from the origin 0 3 — 1 , 

"Q 

Pj accounts for gravitational acceleration, g if j = 0, else 0, 

* 

s .i = p* + r‘, a vector to the center of mass of link j from the origin 0 3 — i, 

< = r'j/mj, 

the angular velocity vector of link j, 

J} the inertial tensor (with respect to its center of mass) of link j , 

the joint generalized variable for joint j, 9 if rotational and s if translational, 
t j the joint generalized actuator force at joint j, torque if rotational and force if translational, 
iiie iuiai fuice (exeiuuiny gravity) on link j, 

N j the total torque on link j, 

fj constraint force (unknown) exerted on link j by link j — 1, 
n j constraint torque (unknown) exerted on link j by link j — 1, 

A? a pure rotation matrix mapping vectors in 0 3 into vectors in 0 3 — i, 

'W 3 a pure rotation matrix mapping vectors in 0 3 into vectors in 0 t , 

W i<3 = i ~ 1 W 3 , 

u a , b differentiated in 0 a -1 and referred to Ob- 
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S. PARALLELISM WITHIN A NODE 


This section will investigate the effect of exploiting parallelism within a node (joint), while 
remaining within the general linear recursive framework. As a conceptual aid in this section 
we assume that there is one group of parallel processors for each joint (node) on the forward 
and the backward recursion. However, since only one node is active in the computation 
of any given variable at any one step, and all nodes are identical, an implementation could 
be constructed using only one processor group by connecting the output back to the input 
through a buffer. Details of how this might actually be done are explored in Section VI. If 
only one processor group is used, computation of one set of joint torques must be completed 
before the next can begin. Otherwise, with one processor group for each joint (node), it is 
possible to systolically pipeline successive sets of joint torques at intervals of 1 Mult + 3 
Addns , or 4 Flops. 

The linear recursive structure of the Newton-Euler computations may be shown as a 
directed graph as in Figure 111.1. The essential structure of the algorithm can be clearly 
seen — acceleration is propagated outward (incorporating at each stage the next joint 
acceleration) and forces are thereafter propagated inward (allowing calculation of joint 
torque at each stage). Nodes represent the total processing associated with each joint in 
the forward or backward recursion, and directed arcs represent data dependencies. It is 
clear that reducing the computational time incurred at each joint (node) would imply a linear 
reduction in the total computational time (a constant factor speed-up). For reasons which 
will be explained in Section IV, a double subscript is used in Figure 111.1: thus (0 ,j) on 
the backward recursion, and {j,n) on the forward, denotes the variables output by node j. 
Non-systolic and systolic arrangements are indicated in Figures III.2 and 111.3. 

Tables 111.1 and 111.2 show the detailed internal structure of each node of the directed 
graph of Figure 111.1. The computation times given in Table ill. 1 reflect the times represented 
by the indicated operation. In general we follow Hollerbach[l6] in accepting the principle 
that economies in computation should be explicit in the formulation rather than implicit 
in the programming, and Table 0.1 reflects his analysis for the fully general 6-link rotary 
manipulator. It should be noted, however, that by “hand-tailoring” the computation to 
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account for special cases the computational cost can be frequently be reduced below that 
shown in Table 0.1. For example, if the particular manipulator configuration is such that 
some of the constant angles a, are multiples of §, then the factors cos a, and sin a, become 
0 and ±1 and multiplication by the rotation matrix A, can be reduced from 9 Mults and 
6 Addns to 4 Mults and 2 Addns. Since y, ■ p* — 0, vector cross products involving p* 
can be calculated with a special sub-routine in only 4 Mults and 2 Addns. Several other 
optimizations are possible, e.g. see [18],[20],[28],[29]. Many of these apply to the Stanford 
arm, which therefore has a substantially lower computational cost than the fully general 
case. Newton-Euler dynamics particularized to the Stanford arm requires only 308 Mults 
and 254 Addns [28], and Kane’s dynamics requires only 646 Mults and 394 Addns [20]. A 
rotary manipulator particularized to a spherical wrist and non-spinning base requires only 
448 Mults and 361 Addns [18]. Our analysis captures the fully general case with time bounds 
substantially below these, and in any event the intended implementation in VLSI argues 
strongly for a regular and systematic treatment which avoids all special cases. 

The times shown in Tables lll.l and III.2 reflect, for each variable, the rotational case 
only. The extension to the translational case follows directly in analogous fashion, and the 
time bounds stated remain almost exactly the same with only minor variation in the constant 
term. 

Note that, while on either the forward or backward recursion, the linear coefficient in 
the total time cost is determined by the time required to propagate the recursion variables 
through a single node. More specifically, this is the interval between the time that the 
recursion variables become available to one node and the time that they are made available 
to the next node. However, nothing constrains all of the variables to be made available 
at the same time. Since the different recursion variables are used at different times in the 
computation, relative delays are acceptable provided that each variable is available by the 
time that it is required in the computation. 

If the different variables become available to a node at staggered times (and in turn are 
made available to the next node at equally staggered times), the linear time coefficient is 
determined by the longest time required to propagate any single variable across the node. 
In determining this time only u it p £ , f it and n t need be considered, as the other variables 
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are not propagated down the recursive chain. 

Examining Table 111.1 for the Newton-Euler formulation, it is clear that (given the proper 
staggering constant offsets) no variable on either the forward or backward recursion requires 
longer than a matrix-vector multiplication and a vector addition to propagate through a node. 
This is evident because each propagated variable becomes available to the (t -f 1)— node 
one matrix-vector multiplication and one vector addition after it is made available to the *— 
node. Thus, the time required for the entire dynamics calculation may be reduced to 

2n • (MV + VA) + C 

where (MV) and (VA) are respectively the time required to perform a matrix-vector mul¬ 
tiplication and a vector addition, and C is a constant which accounts for initiating the 
calculation. 

This is the case even though each individual variable may take longer than (MV + VA) 
to compute. Table III.2 illustrates this arrangement graphically. The times when each 
of the variables become available are shown in Table III.2, as well as the intermediate 
partial results. For simplicity of presentation in Table III.2, a timing model is used in which 
1 Mult — 1 Addn == 1 Flop. This metric will also be used to establish the relative times at 
which various partial results are computed and made available. 

w, depends on nothing except w,_i and z,_ i?,, and requires time (MV-\-VA) to compute 
once these are available. It is clear that successive values of w* will become available at 
intervals of (MV + VA), since each depends upon nothing but its preceding recursive 
variable value and the input, w, depends on w,_i as well as on wf_i and However, 

w,—! and z,_i< 7 , are required in the computation well before w,_i. Given the availability of 
{w,} at intervals of (MV + VA), any w ; -dependent intermediate partial results required by 
w, can be calculated before w,-_i becomes available (by delaying wj if necessary, as will 
be seen below). These intermediate results already computed, when «,_] does become 
available it will be possible to compute w, in time (MV + VA). Meanwhile computational 
precursors to w,+i have been similarly calculated, so that when w* becomes available it is 
then possible to compute in time (MV -f- VA)\ and so forth. In this fashion, it can 
be seen that the availability of {w,} at intervals of (MV + VA] implies the availability of 


28 



{w,} at similar but offset intervals. By a similar reasoning process, both of these imply the 
availability of {p,} at intervals of {MV + VA). The structure and timing which realize this 
are shown in Table III.2. Similar considerations would hold for the forward recursion. 

The availability offsets and the constant C may both be determined from Table 111.1. This 
is done in Appendix A, where the appropriate offsets are shown to be 

Avail{u)i—\) > Avail(wi—i) + VC + VA (*) 

Avoj7(p,_ 1 ) > Avail(u,—i) + 2 VC + 3 VA 
Avail{p { _ j) > Avai7(w,_i) + VC + 2VA. 

Auai7(n,_|-i) > Avail{fi+i) VC + VA. (***) 

The three equations above, (*), (**), and (***), define the constant offsets or relative 
delays by which the propagation of the Newton-Euler recursion variables should be staggered 
in order to achieve the stated time bound. Note that if computation is allowed to proceed 
on a scheme whereby a node operates on its inputs as soon as the data becomes available, 
then these offsets will be naturally set up and maintained by the inherent computational 
delays shown in Table 111.1. 

Next we determine the constant C by showing when r t , the last generalized joint force 
of the forward recursion, becomes available as output. This may also be done from Table 
lil.1, and details are also shown in Appendix A. There it is shown that, assuming all input 
values become available simultaneously at time t = 0, the time required to calculate the 
Newton-Euler dynamics exploiting maximal linear parallelism is 

2n • (1 Mult + 3 Addns ) (5 MuIts + 9 Addns). 

Due to differences (not requiring additional computation by the host) in assumptions about 
the form of the input and output (not in the computation), this is slightly lower (by 2 Mults + 
2 Addns) than given in [25]. Specifically, we assume that the input buffer will deposit q, and 
q t as the third scalar coordinate behind two pre-stored scalar zeroes so as to make available 
the vectors 2 ,- 19 , and z l -\'q i directly, and that the output buffer will directly return the third 
scalar coordinate of /* (for translational joints) or n, (for rotational joints) as the indicated 
joint force or torque. This input/output convention avoids the algorithmically indicated cost 
of SV and VD without introducing special cases into the computational structure. 
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Table III. 1 — Relative Time of Linear Data Dependencies 


Linear Backward Newton-Euler Recursion 

Var. 

Waits OnJ 

Time at Step End 

Step 

Costf 

w, 

o = Avail(fjJi—\ ) 

Input 

Ti = z,_i qi 

* * * 

a + VA 

T 2 = u>i— i + T\ 

VA 

a + MV + VA 

w, = Aj T 2 

MV 

W, 

a = Avail( w,_i) 
b — Avail(d> t —i) 

Input 

T 3 = z t -iq t 

* * * 

a + VC 

Tj = w,_i X T\ 

VC 

a + VC + VA 

n = t 3 + t 4 

VA 

max( 6 , a -f- VC + VA) + VA 

7ti X 5 “j~ cj j —i 

VA 

max( 6 , a VC + VA) 

+ MV + VA 

u>i — A] T 6 

MV 


c = Avail( w,) 
d = Avail(ii>i) 
e — Avail(p t _i) 

c + VC 

T 7 = w, X p] 

VC 

c+2VC 

T$ = w, X TV 

VC 

d + VC 

To = U), X p* 

VC 

ma x(d + VC, c + 2VC) + VA 

Tio = Tg -f- T 9 

VA 

e + MV 

T n = Ajp,_ i 

MV 

max(e + M V , d -f- VC + VA, 
c + 2 VC + VA) + VA 

y- 

II 

c 

+ 

VA 

- 1 

ft 

c = Avail[oJi) 
d = Avail(d>,) 
f = Avails) 

c+VC 

T 12 = Wj X r, 

VC 

c+2VC 


VC 

d+VC 


VC 

ma x{d + VC,c + 2 VC) + VA 

EjKgSX^I 

VA 

ma x(f,d + VC + VA, 
c + 2 VC + VA) + VA 

■ 

■ 

VA 


| See end of table (continued next page). 

t “Avail(Xi—i) = t” means that variable is made available to the i— node at time t (on the 

backward recursion; substitute .XV-i-i on the forward recursion). 
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Table 111.1 — Relative Time of Linear Data Dependencies (continued) 


Linear Backward Newton-Euler Recursion (continued) 

Var. 

Waits On 

Time at Step End 

Step 

Costj 

Ft 

g = AvaiOji) 

ff + sy 


SV 

N t 

c = Avail(u) t ) 
d — Avail{<jj t ) 

c +MV 

Tie — 

MV 

c + MV + VC 


VC 

d+ MV 

Tie = J,w t 

MV 

max(c + VC, d) + MV -f- VA 

Ni = Tn + Ti % 

VA 


Linear Forward Newton-Euler Recursion 

Var. 

Waits On 

Time at Step End 


Costf 

/. 

h — Avail(f i-f-i) 

h+ MV 

HIBQBBBH 

MV 

h + MV + VA 

WEBB3EBKM 

VA 

n, 

h = Avail(f, + \) 
k = Avail(n, + i ) 

already computed * 

T20 — s i X F, 

(VC*) 

already computed * 

T' 2 \ = T20 + Nj 

(VA*) 

h+MV + VC 

T22 — P, X Tm 

VC 

h + MV + VC + VA 

T23 — T21 -f- T22 

VA 

k + MV 

T24 = -A,+in t+ i 

MV 

max(/c, h + VC + VA) 
+MV + VA 

n, = T23 + T24 

VA 


* (that is, computable before the recursion reaches the node, except at the initial node) 


| VA = time cost of Vector Addition 
VC — time cost of Vector Cross product 
SV = time cost of Scalar multiplication of a Vector 
MV =time cost of Matrix multiplication of a Vector 
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Table III.2a — Timing of Linear Newton-Euler Backward Recursion 
Timing of w, and w*; n = 4, rotational joints 

CJg “ W base* ^0 == &base 

(For simplicity here, 1 Mult = 1 Addn — 1 Flop) 



20 _ 
21 _ 
22 _ 

23 _ 

24 _ 

25 _ 


32 



Table III.2a — Timing of Linear Newton-Euler Backward Recursion (continued) 

Timing of p,; n = 4, rotational joints 

Po Phase 

(For simplicity here, 1 Mult = lAddn = 1 Flop) 



0 _ 
1 _ 

2 _ 


3 _ 



24 _ 

25 _ 
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Table III.2a — Timing of Linear Newton-Euler Backward Recursion (continued) 
Timing of r„ F t , N, \ n = 4, rotational joints 

(For simplicity here, 1 Mult = 1 Addn = 1 Flop) 
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Table III.2b — Timing of Linear Newton-Euler Forward Recursion 
Timing of /„ n,; n = 4, rotational Joints 

(/$ == /tip) ^5 = r^tip) == [0, 0, 1] • ttj 

(For simplicity here, 1 Mult = 1 Addn = 1 F7op) 


Time 


21 _ 


41 _ 

42 _ 

43 _ 

44 _ 

45 _ 

46 _ 


fi = -At+i/i+i 

+F t 


22 _ 


— 

n, = 




23 _ 




+P t x (At 

+l/<+l) 

24 _ 

MV 



+ N, + a* 

X F, 


25 _ 


_ A r ./s 





26 _ 

VA 

_/< 

VC 



— 

27 _ 




_ s\ X F 4 

VC 


28 _ 

MV 


VA 

_+n 4 ^ 


_ j 

29 _ 


_ A 4 } 4 



VA 


30 _ 

VA 

_ fz 

VC 



— 

31 __ 




_ s* X F 3 

VC 


32 _ 

MV 


VA 

_+^3- 


_ 1 

33 _ 


_ A 3 f 3 



VA 


34 _ 

VA 

_ fi 

VC 



_ 

35 _ 




_ S 2 X F 2 

VC 


36 _ 

MV 


VA 

_+iV 2 - 


i-i 

37 _ 


_ Aifi 



VA 

C 

38 _ 

VA 

_/i 

VC 



— 

39 _ 




_ s* X Fi 

VC 


40 _ 



VA 

[_+n 4 - 


r- J 


MV 


VA 


p ;xa 4 / 4 MV 


VA 


VA 


MV 


VA 


MV 


VA 


Asn$ 

n 4 


A 4 n 4 

«3 


, A 3 n 3 
n 2 


A 3 ri2 
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(“ Bkwdi " = backward recursion processor for joint t, “Find" = forward) 



Figure 111.1 — Linear Recursive Graph Structure 
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(Time Intervals 1 (MV + VA) per complete set of joint torques) 

(“ Bkwd = backward recursion processor for joint t, "Fwd," = forward) 
Figure III.3 — Systolic Pipelined Process, n = 4 
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4. PARALLELISM EXPLOITING LOGARITHMIC RECURSION 


The preceding section showed that parallelism potentially leads to considerable time savings 
even within a linear recursive framework. This section will show that the linear time 
dependency can be improved. By a suitable restructuring of the basic computational 
framework within which the nodes are embedded, together with a corresponding revision to 
the recursive forms of the equations, an O(log(n)) total time may be achieved. For convenient 
reference, the formulae are collected in Table IV.1. 

The basic intuition is illustrated in Figures IV.1.a and b, applied to n consecutive 
multiplications. If these are performed serially, as in Figure IV. 1 .a, then time 0(n) is required. 
By processing in parallel (Figure IV.1 .b), time O(log(n)) may be achieved. It is easy to see 
that general recursive calculations of the form 

Xi = a,i,_ i -f- b, 


may be performed in time O(log(n)). 

To exploit this potential in the inverse dynamics case, however, it is necessary to 
generalize the linear recursive equations. The linear form may be regarded as an operator 
© which maps a variable (X,— i) representing the base through (»'— 1)— inputs, together 
with th*» »— innut IT \ into (X ropriagAntinn the hag® through ?— inputs; 

Xi-! © Ii ~ Xi. 

For logarithmic recursion, an operator ® is required which maps a variable (X a ,k) repre¬ 
senting the a— through k l -E inputs, together with (X( fc+ i) it ,) from the (k -f-1)— through b^ 
inputs, into (X a<t> ) representing inputs a through 6: 

X a 'k ® ),6 * * X ai b. 

The linear recursive equation is a special case of the more general logarithmic form in which 
a = o, k — i — 1, b = k + l = i, X it , = /,, and computation proceeds serially, so that 

Xi = X 0 ,i 

— Xo t (i—i) ® Xij 

= X,_ 1 ©/,. 
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This is the meaning of the double subscripts in Figure 111.1. 

What physical meaning can be assigned to this? We will return to this question during 
the analysis below, where the discussion can be illustrated more clearly by reference to 
“real” physical quantities. For the present, however, the recursive forms (both linear and 
logarithmic) may be thought of as mechanisms for relating physical parameters at one 
coordinate system (or link or joint) to physical parameters at another, by abstracting away 
the intervening links of the physical manipulator. Very loosely speaking, the goal in both 
cases is to relate the acceleration of each link to the acceleration of the base by abstracting 
away the intervening joint accelerations, then to relate the distal forces and torques acting 
on each link to the distal forces and torques acting on the tip by abstracting away the 
intervening joint forces and torques. This done, the joint forces and torques necessary to 
sustain the desired acceleration may be found from a purely local application of Newtonian 
mechanics. We will use the term “relational parameter” to refer to a quantity which relates 
physical parameters at one coordinate system to physical parameters at another. 

The linear and the logarithmic recursive forms differ principally in how they approach 
the problem of relating physical parameters between coordinate systems (or links or joints). 
The linear form relates the base (backward recursion) parameters to the first link parameters 
to obtain relational parameters of the form Xi (== X 0 .i)- These in turn are related to the 
second link parameters to obtain relational parameters of the form X 2 (= X Q , 2 ), which relate 
the second link to the base. In sequence, the relational parameters X 0 , 3l X 0 , 4 , ..., X 0 ,„, are 
formed, which relate respectively the third, the fourth, and the n— links to the base. The 
process may be viewed as one which, at each step, “glues” the next link parameter onto 
the current relational parameters to produce the next relational parameter, thereby relating 
the base to the next successive link. 

In contrast, the logarithmic recursive form may be viewed as a mechanism for “gluing 
together” any two adjacent relational parameters. Parameters of the form X t , t reflect only 
the input values at link (joint)»; those of the form X itJ reflect (abstract away) links »through 
j. At the first step, adjacent pairs of (backward recursion) relational parameters of the form 
X 2t , 2 i, X 2t+u2i+l are related in parallel to form the relational parameters X 2 i , 2 t+ i. Thus on 
the first step we relate alternating pairs of adjacent links to form the relational parameters 
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-X’o.i. X 2 , 3 , -X' 4 , 5 , ..., (if n odd). At each j& succeeding step, all adjacent pairs of 

the form X a>k , X k+Ub are related in parallel to form the relational parameters X a , 6> where 
a = 2>m, Jfc = a + 2>~ 2 , k < b < a + 2 J — l, and 0 < m < n/( 2 J ). On the second step, 

-X”o, 2 , X 0 , 3 , X 4 , 6) X 4>7 .X„_ 3 >n _i, X n - 3 ,n are additionally formed; on the third step we 

also pick up Xo )4 , Xo,s, Xo, 6 > Xo ( 7 , Xg ( i 2 .X n — 7 ,n— 1 , X n _ 7 , n ; and so forth. This process 

is illustrated in schematic in Figure IV .2 (only the backward recursion is shown). There, 
n = 7, so the backward recursion would take three steps as shown. It is easy to see that in 
riog 2 (n +1)] steps, all (backward recursion) relational parameters of the form X 0rt , 0 < * < n, 
may be formed. But as noted above, these are just the linear recursive variables X t . Thus 
the backward recursion may be performed in real time in 0 (log 2 (n + 1 )) steps using parallel 
computation. The same is true of the forward recursion, hence of the Inverse Dynamics 
computation. 

The logarithmic recursion operator must possess the following recursive properties: 

(a) X atb must be computable only from inputs a through b, 

(b) X a>b = X a<k ® X (A+1 ),(, must be computable from variables of the 
form Y a . k or Y(a + i ).6 or previously computed Z a , b or non-recursive values, 
and 


(c) X 0 ,i on the backward recursion, or X,, n on the forward recursion, 
must be equal to the value of the linear recursive variable Xi. 

These properties will allow the use of a structure analogous to Figure IV.2. 

The remainder of this section will be devoted to the derivation of appropriate logarithmic 
recursive formulae. Separate sets of formulae will be developed for the forward arsd the 
backward recursion variables. As in Section III, it is necessary to consider only those 
variables which are propagated the length of the recursive path. For the Newton-Euler 
formulation these are w,, d>i, f it and n { . These formulae are collected in Table IV.1. 

Suppose one wished to find the value of non-propagated values, for example to find r m 
for node m on the forward recursion path. One first computes w 0 ,Tn.u'o,m. and Po,m using the 
logarithmic forms given in Table IV.1. By the remarks above, these are exactly w m , and 
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p,„ of Table 1.1; and the time needed here to compute all three is O(log(m)). From these, r m 
may be computed in constant time 0{c) using the formulae in Tables 1.1 and 111.1. 

In the derivations below, for each linear recursive variable, our general strategy will be 
as follows: 


(a) propose a closed-form, non-recursive formula which is equivalent 
to the linear recursive formula of Table 1.1, 

(b) show that (a) is correct by showing that it is a fixed point of the 
linear recursive formula considered, and holds for the zero term (this is 
equivalent to an inductive proof), 

(c) propose a non-linear recursive formula which is equivalent to (a) 
at a — 0 and b = i, and 

(d) show that the resulting combining operator <g> is correct by 
showing that it preserves the form of the formula in (c). 

The reader may verify that for a = 0 , k — i — 1, b = k -)-1 = i the formulae reduce to Table 
1 . 1 . 

It will be necessary to introduce several auxiliary variables not directly corresponding to 
any variable in the linear recursive formalism. There we will be concerned to show only that 
the asserted combining operator is correct. 

in the following we assume a < k < 6 throughout, the case of a = 6 being found as a 
special case of (c) above. In order to cover both the rotational and translational cases, it is 
convenient to introduce 

( 1 , joint i rotational, 

0 , joint t translational. 

Bi = 1 — Ci 
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NEWTON-EULER BACKWARD RECURSION VARIABLES: 


Introduce the auxiliary variable W a ,i,, which will represent products of the coordinate 
matrices A,. Notation sometimes used elsewhere in the literature is 1 W>,, but we write W a ,b 
here for consistency with other variables. Let 


= n a > 


J *=a 

/ fr 


iu n a ’ 

V./=n /\J=(fr+l) 

W a , k Wik+l),b 


It can readily be seen that W, Ik b maps the coordinates of a vector expressed in the system 0 t , 
into coordinates expressed in C a - 1 . The physical significance of the combining form (g> is 
to compose a mapping which relates 0 f , and 0 k with a mapping which relates Ok and 0 „—1 
in order to produce a mapping which relates the coordinate systems Ob and 0„_j. Thus 
W a , a = A a and W„,„_j = I (the identity matrix). 

In the case of W a , b , the combining operator (g) is matrix multiplication and 


W a ,k ® = Wa,kWk + \,b 


Hereafter we will merely display the combining form without drawing explicit attention to the 
operator <g), acknowledging implicitly that X a<h = X a ,k ®X k +i,b- 

Next consider the angular velocity w : . We define z { — i) to be the axis of rotation of the 
base system and q 0 to be its magnitude, so that w 0 = A% z ( _i)<j 0 . This will be non-zero if, for 
example, it is desired to include the Earth’s rotation in the calculations (perhaps for satellite 
applications). If one creates a fictitious frame 0<_i) at the Earth’s center, then z ( _i) points 
through the North Pole and q 0 is equal to the Earth’s angular velocity. 

We show that w, satisfies the following closed-form non-recursive formula 

i 

Ui = Wj'iCjZj- 
i- o 
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because this is a fixed-point of the recursive formula for w, given in Table 1.1. 

t —1 

=^[(^2 W ],x-i a 3 z 3-i<h + 

J=0 

= Aj (w,_! + a.Zi-igi) 

In order to produce an identical form at A = 0 and b = define 

b 

U a ,b = ^2 Wj^jZj^q, 
j~a 

k b 

= Y1 { w 3^ w (k+\),b) <r., z .,-\qj + ^2 W 3^ cr 3 z j~ii> 

j=a j — k +1 

— W( k+l )' h Ua.k + W( t + i),t 

This is the combining form for w a , b - 

uj a , b expresses the angular velocity of Ob relative to 0 a —u referred to 0 b . Thus oj ui is 
the angular velocity of 0, relative to the base frame, and w 0 ., also accounts for the rotation 
of the base frame (if qo = 0 then these are the same). The combining form for u a , b is a 
means of composing the angular velocity u k+ub of 0 b relative to 0 k with w a , fc , that of 0 k 
relative to 0 u _i, in order to obtain the angular velocity of 0 b relative to In particular, 

ui a , a —\ = 0. The rotation matrix Wl +l h in the formula transforms u> a , k from O k into O b , the 
system to which is referred. 

Similar remarks apply to the physical significance of u a ,b, p a ,b< etc - Derivations of the 
other propagated recursion formulae, shown in Table IV.1, are given in Appendix B. Note 
that w 0 is the angular velocity of the base, non-zero if the Earth’s rotation is modeled as 
in some satellite applications. Also, p 0 is the acceleration of the base. Typically this is the 
acceleration due to gravity at the site. If one took w 0 7 ^ 0 then p 0 may also include a term for 
x (w 0 x pj), where p* Q is a vector from the Earth’s center 0(_i) to the site; this accounts 
for the centripetal acceleration arising from the rotation of the Earth. One may account for 
gravitational acceleration by taking p 9 0 = g, else pf = 0 for i ^ 0 . The term involving p? is a 
technical artifice to account for p 0 cleanly, and vanishes in the combining form. 

As mentioned in Section II on notation, we slightly abuse the dot notation for time 
differentiation (e.g., u a , b ) to indicate the time differentiation of u a , b from within the coordinate 
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system O a — j • This is equivalent to the standard notation at a = 0 (the case of interest), and 
results in a less bulky notation. To illustrate this usage, we consider an alternate derivation 
of u a .b, where £ denotes time differentiation in 0 C , and c u a , b denotes u a>b referred to 0 C . 


u a ,b 


dt 

«f(°-!) 
dt 


Ua,b 


{ bu a,k + 


= h &a,k + 


d^-V, 

dt 


Wfc+1,6 


d k 


W k _^ l b U at k + ^a,k X Wfc-j-1,6 — Wk + 1,6 

VVj +1-(l W„,A. -|- (Wfc + lih W a .fc) X W/r + l.i, + Wfr-H.h 


which is our combining form. 


NEWTON-EULER FORWARD RECURSION VARIABLES 


On the Newton-Euler forward recursion, the coordinate matrix products of interest will 
be W a+1 . A+1 instead of W 1 k+Ull . This is because we wish to transform from Oa + i to 0 a , 
instead of from 0 k to 0 b - Intuitively speaking, we are now working from the tip of the 
manipulator back toward the base. Hence we are desirous of transforming some relational 
parameter -XV+,. ( „ which abstracts the subchain from coordinate system (or link or joint) 
k -4-1 to b and is expressed in 0<, , into the equivalent quantity expressed in D-. The X,. , lit 
so transformed can be combined with X aik , representing the subchain between a and k 
expressed in 0 a , to yield X a , b expressed in 0 a . We will here assume that the necessary 
products W 0 +i,k+i are generated on the forward recursion in accord with the formulae 
above. This requires a minor abuse of notation, as the combining form would then be 

Wo+l.H-l — Wa+1,*-H ® VFt-|-2,6-1-1 

but we have agreed that our combining forms will be 

If the reader finds this troublesome, we suggest that she or he make the substitutions 
o' = o + l, 6' = 6+l, and k' = k + 1. This done, one need only remember that 

Wo',a' = ■^o+l* 


45 



rS 


Tabla IV.1 -lafaiiiMc Wecoratv Fwiillllont 


Logarithmic Backward Rocuraion: 


( l , joint a rotational, 

0 , joint a franalatto na l. 

a 0 = l — 

w a , b »{ 

if«#ft. 


Wo,6 


*(•—1)4» 

W r ^ + i )>t) w 0 ,* -f 


if a* ft; 
if a ^ ft. 


: .-r> 


&ajh 


*{«—1 

Wfk+I),**** + (** r ?fc+t),i w o^) X w (fc-i-iL» 4- If « ^ ft. 


Co,4 


ifa—ft; 

^ r tfc+l),*CB,k *f C(*+l),t if*jftft. 



















r~x Table IV. 1 — Logarithmic Recursive Formulations (continued) 

Logarithmic Backward Recursion (continued): 


K 

Ho,k =| 


if o «■ fc; 
fc + Jl(*+|),» if O b. 


Wfl,o X Pa 

if a = i; ; 

- 

Sa t b~' 



W(k+l),b S °,k + (W?fc+l ),»«<*,k) X #(fc-H).fc + S {k+l),b 

if ajLb. 


PS + W«,« Xp‘+ W„,„ X (<*>„,„ X P*) + 9 a AZ* a -tia + 

2w«,e X 9«,a 

if • =* 4; 

Po,t =' Wfk+l),bPa,k + P(fc+l),t "MR^fc+i ),*<•>«>) X %+l),6 




+(W r ^4.x),fcW 0 ,fc) X 


,k) X rt(k+i)^ 4* 2($<k+i 



if a ^ b 


Logarithmic Forward Recursion: 


F m 


fa,b ! 


(N. + txF. 

Oo,fc 

l»o,* + W*+M-H 


if o = 6; 





*«». 6; 
iAm) X /(k-4-l)t*^ K a fi t. 


47 










Figure IV.1b — Serial Vs. Parallel Multiplication (Parallel) 
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X<*£*X*. H*' 


v 


X»,rX, 


♦ A 




Figure IV.2 — Logarithmic Recursive Graph Structure 
(Only backward recursion shown) 
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5. OPTIMIZED LOGARITHMIC RECURSION 


Because here the extension to the translational case may not be entirely obvious, the equa¬ 
tions and analyses presented cover both rotational and translational joints in manipulators 
composed of mixed systems. Analogously to Section III, as a conceptual aid we assume that 
there is one group of parallel processors for each node shown in Figure IV.2, and also one 
group for each node in the similar forward recursion. However, only one row (tier) is active 
in the computation at any one step, and all tiers are identical. Thus an implementation could 
be constructed using only one physical device for each pair of joints of the manipulator, 
by connecting the outputs back to the inputs through a buffer. Implementation details are 
expanded further in Section VI. The comments about the 4 Flop systolic pipeline interval, 
found in the introduction to Section III, also apply here. 

The detailed internal structure of a node is presented in Table V.1, which is analogous to 
Table 111.1. Note that the formulae presented in Table IV.1 for the forward recursion require 
calculating VV a+] .f, + i and R n , h again, exactly as was done on the backward recursion. These 
have no interesting data dependencies, do not bound the computation, and are computed 
exactly the same on the forward as on the backward recursion — thus for conciseness they 
are shown only once in Table V.1. 

Deiay conditions for the iogaritnmic case proceed as shown in Appendix C, similarly 10 
Section III. 

Since it is possible to satisfy the minimum delay conditions, propagation occurs at a 
rate of (MV + VA) per node. Assuming that the delay conditions are satisfied and that all 
input becomes available simultaneously, it can readily be seen that 

Avail{X a , b ) = Avail(X a<a ) + jlog 2 (i> - a + 1)](MV -f VA). 

Thus in particular, if X t is the linear recursive variable corresponding to X a ,b, then X t — X 0 ,i 
so 

Avail{Xi) = Avail(Xo,i) 

= Avail(X a<a ) + [log 2 (t + 1)](MV + VA). 

Assuming a maximally parallel implementation as before, the total time of the calculations 
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is shown in Appendix C to be 


2[log a (n + 1)1(1 Mutt + 3 Addn«) + 5 M 







Table V.1 — Relative Time of Logarithmic Data Dependencies 
Logarithmic Recursive Forms; 0^6 


Logarithmic Backward Newton-Euler 

Recursion 

Var. 

Waits OnJ 

Time at Step End 

Step 

Costf 

■m 

a — Avail(W Xt y) 

a MM 

w a , h = w a , k w k + 1,6 

MM 

W a ,fc 

a = Avail{W Xt y) 
b = Avail(uj Xt y) 

max(a, 6) + MV 



max(a, 6) + MV + VA 


VA 

w„. h 

a = Avail(W x , v ) 
b — Avatl(uij v/ ) 
c — Avail(d/j' V ) 

max(a, c) -f MV 

T i = Wl +uh w a , k 

MV 

max(a, b) + MV + VC 

— T\ X 

VC 

max(a -|- MV -f- VC, 

b + MV + VC, c) + VA 

T.) = T;j -f- 

VA 

max(a -+~ VC VA, 
b + VC + VA, 
c) + MV + VA 

w a ,t = T 2 + T a 

VA 

Ra.b 

a — Avail(W x , y ) 
d — Avail(R x<y ) 

inax(a, d) -f MV 

T s = Wl + U h Ra.k 

MV 

max(a, d) -)- MV -+- VA 

Ra,b = T5 -(- + 

VA 

Sa.b 

a = AvaiHW Zl y) 
b — Avail{<jj T _y) 
d — AvaU(K Xt y) 

e — Avail(Sj'y) 

max(a -j- MV, b -f- MV, 
d) -f VC 

.C 

+ 

X 

e*h 

li 

VC 

max(a, e) MV 

7 7 — W 

MV 

max(a + MV + VC, 
b+MV + VC, 
d + VC, e) + VA 

Tg = Te + ^-(-1,6 

VA 

max(a -f- MV -{-VCA~VA, 
b + MV -f- VC + VA, 
d + VC + VA, 
e + MV) + VA 

Sa,6 — T 7 + ^8 

VA 


a = Avail(W x .y) 
f = Avail(Q X 'y) 

max(a, /) 4 - MV 


MV 

ma x{a,f) + MV + VA 

< 2 a,fc ~ Tg + Qk-+1,6 

VA 


f See end of continued table, 
t X x<y = both of X a , k and X fc+ i , 6 



















Table V.1 — Relative Time of Logarithmic Data Dependencies (continued) 
Logarithmic Recursive Forms; a ^ 6 


Logarithmic Backward Newton-Euler Recursion 


Waits On$ 

Time at Step End 

Step 

Dostf 

a = Avail(W X)V ) 

max(a -f MV, c + MV, d)-\-VC 

Tio — T 2 X R k +i,b 

vc 

b — Avail(uj It y) 

max(e, /) + V A 

T\\ = + Qk+l,b 

VA 

c = Avail(u> x<y ) 

max(e, f) + VA + SV 

T] 2 = 2 Tn 

SV 

d = Avail(R X 'y ) 
e = Avail(Sx.ti) 
f = Avail(Q x , y ) 

max(a + MV + VC, b + MV -f VC, 
d -f VC, e + VA + SV, 
f + VA + SV) + VA 

T\z = T$ + T 12 

VA 

g = Avail(p x y ) 

max(a + MV + VC, b + MV + VC, 
d + VC, e + VA + SV, 
f + VA + SV) + VC + VA 

T 1a = T, X Ti 3 

VC 


max(a + MV + 2 VC + VA, 
b -f MV + 2 VC -f VA, 
c 4 - MV + VC, d + 2VC + VA, 
c + VC + 2 VA + SV, 
f + VC + 2 VA + SV) + VA 

T 15 = Tio + Th 

VA 


max(a, g) + MV 

T\a — W/--m .fcPa .*■ 

MV 


max(a + MV -f 2 VC + 2 VA, 
b + MV + 2 VC + 2 VA, 
c + MV + VC + VA,d+ 2 VC + 2 VA, 
e +VC + 3VA+SV, 

/ + VC + 3VA + SV, g ) + VA 

T\- = Pjfc + 1,6 + ^15 

VA 


max(a + MV + 2 VC -f 3VA, 
b -f MV + 2 VC + 3 VA, 
c + MV + VC + 2VA, d + 2 VC + 3VA, 
e +VC + 4VA + SV, 
f + VC + 4VA + SV, g + MV) + VA 

Pa,6 = ^16 + ?17 

VA 


t See end of continued table, 
t X x , y = both of X a , k and X k+i , b 
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Table V.1 — Relative Time of Logarithmic Data Dependencies (continued) 
Logarithmic Recursive Forms; 


Logarithmic Forward Newton-Euler Recursion 

Var, 

Waits On| 

Time at Step End 

Step 

Costf 

fa,b 

h = Avail(f Xt y) 

h + MV 


MV 

h + MV + VA 

fa, b = }a,k + Tjg 

VA 

,6 

h = Avail(f x<v ) 
i — Avail{n JtV ) 

(already computed *) 


{MV*) 

h + VC 

T-n = T'20 X /fr + 1,6 

VC 

h + MV + VC 


MV 

max(i, h + MV + VC) + VA 

T 23 = n a , k + T22 

VA 

i + MV 

T 24 = W a+1 . k+ m k + 1 , b 

MV 

max(i, h + VC + VA) 

+MV + VA 

n a ,b = T23 + T24 

VA 


* (that is, computable before the recursion reaches the node, except at the initial node) 

X X x<y = both of X a<k and X k+Ub 


f VA = time cost of Vector Addition 
VC — time cost of Vector Cross product 
SV — time cost of Scalar multiplication of a Vector 
M V =time cost of Matrix multiplication of a Vector 
MM =time cost of Matrix Multiplication 
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6. IMPLEMENTATION CONSIDERATIONS 


As noted earlier, the main thrust of this paper has been an exploration of the extent to which 
parallelism can be pushed in this particular domain. Nonetheless, we hope to indicate that 
the physical requirements are not particularly excessive and thus this might be a reasonable 
thing to actually implement. This section contains a fairly general discussion of hardware 
requirements. Chip architecture is discussed in the next section. 

Except perhaps for some variation in the constant term, the hardware sketched in 
this section is intended to capture the maximally parallel nature of the algorithms and to 
physically attain the stated time bounds. It is always possible to deliberately sacrifice speed 
and gain material economy by allowing hardware multiplexing in a partially parallel system. 

The description of the algorithms, the data dependencies, and the timing, have been 
developed quite generally in terms of matrix and vector operations. One can easily imagine 
many different ways of implementing matrix arithmetic, however. The purpose of this section 
will be to imagine these in sufficient detail to conclude that the algorithms proposed are 
reasonable. We also imagine that any actual physical implementation will differ in many 
particular details from those presented here. That is, we expect that the details of the 
arrangements and requirements which we display will change, but not by so much as to 
icnuei me yeneia! eunoiusiuim inapplicable. 

Much that we will observe, however, can be seen as a consequence of the macro¬ 
scopic computational structure of the algorithm itself. Where possible we will conduct the 
discussion at a level of matrix-vector arithmetic whose operators are expressed in differing 
implementations, and we will seek to discover regularities which constrain the design across 
different implementations. 

We will generally worry about three main concerns which commonly dog parallel systems: 

(a) internal buffering and storage of intermediate results, 

(b) communication and bussing of intermediate results, and 

(c) number of processors. 

We will propose a general computational structure, then use that to consider requirements 



for internal buffering and storage. This will lead us to the communication and bussing 
necessary to get the data to the appropriate storage locations. Finally we consider the 
processing required. 

We accept several intuitions commonly held in VLSI, including: as dimensions scale 
downward complexity scales upward, which leads us to seek simplicity and regularity; and 
as dimensions scale downward processing power becomes cheap and communication, 
bussing, and buffering become the limiting factors. We are for now willing to assume 
global communication (in particular, a global clock and a global reset signal). This avoids 
many problems of timing and handshaking protocol. We further assume that timewise, 
1 Mult = 1 Addn = 1 Flop. This substantially simplifies timing, and would not be an 
unreasonable simplification to impose on a (synchronous) physical implementation anyway. 
In the discussion of buffering in particular, we will use this metric (together with the relative 
offsets and more general timing developed in Sections III and V) to establish the cases in 
which buffering is or is not required. 

Note that the logarithmic formalism as developed in the text above required a separate 
handling of the a = b case. This would certainly be the way one would build it from 
multipliers and adders in order to achieve the tightest possible time bound. However, if the 
most likely implementation of the logarithmic algorithm will be in VLSI or WSI it is here more 
interesting to devise a single regular, systematic structure which uniformly handles both the 
a = b and the cases. Since a = b only happens once each direction we are quite 
happy to purchase simplicity and regularity at the price of an increase in the constant term. 
Appendix D presents a technical artifice by which we can make the a = b case look like 

a 5 , 

The conceptual structure of the computation was shown in Figure 111.1 (for the linear 
case) and Figure IV.2 (for the logarithmic case, with only the backward recursion shown). 
The processor nodes could be hooked up in a regular, regularly extensible structure very 
similar to Figures 111.1 and IV.2. This would permit global pipelining of successive sets of joint 
torques at intervals of 4 Flops. However, note that only one node (linear) or tier (logarithmic) 
is ever active at any given point in the computation. We could therefore map our full 
conceptual structure into a much smaller one, which buffers intermediate results in such a 
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way as to use only one physical processor node (linear) or tier of nodes (logarithmic). This 
would use much less hardware, but would imply that computation of one set of joint torques 
must complete before the next could begin. The basic notion (which will be expanded 
more fully in the discussion below) is shown in Figure VI.la, for conceptual orientation. The 
backward and forward recursion nodes are never simultaneously active, so processors could 
be shared between them; see Figure VI.1b. The choice of architecture will depend heavily 
on whether it is desired to systolically pipeline successive sets of joints torques at 4 Flop 
intervals, or simply to calculate one set. 

BUFFERING 

As shown in Figure VI.1 b, there are four sets of buffers and communication pathways with 
which we will concern ourselves: input buffering, intra-node buffering, inter-node buffering, 
and backward-forward recursion buffering. These may be roughly divided into internal and 
external buffers. From input/output considerations we can put immediate bounds on the 
amount of buffering external to a node which may be required (this is input, inter-node, 
and backward-forward recursion buffering). There are, after all, only so many variables 
to transmit. The potential problem lies in the internal buffering within any one processor 
node, for we have created a whole host of intermediate partial computations which have 
complicated, intertwined data dependencies. The potentially massive buffering required for 
these is actually very tightly bounded. 

Input buffering requires storing the scalar values for *,•_!<?„ l'g,, and A { , as well as 

the manipulator configuration parameters (p*, pf , r‘, a*, z t —i, m it and Ji) and the endpoint 
values (w 0 - w 0 , p 0 > fn+ 1 and n n+1 ), for a total of about 37n-f 15 scalar values. Buffering 
within any one node in the linear case requires at most 15 scalar values of temporary storage 
since (referring to Table 111.1) only w it T 8 , T 13 , 7i 5 and T 17 are not used immediately, but this 
requirement can be met by simply latching the output of the sub-processors into registers 
(as in the next section). This requirement is only 3 scalars in the logarithmic case (for Ti), 
and it can also be met by latching. Buffering between nodes is not needed in the linear 
case, and requires about 54 scalars per node in the logarithmic case (if outputs are latched, 
this can nearly be halved). Backward-forward recursion buffering involves at most only the 
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passed recursion variables from each joint. (Additional intermediate storage is needed if 
a systolic pipelined architecture is implemented, by about a factor of n, linear, or [log 2 n], 
logarithmic.) Buffering requirements are considered in more detail in [25]. 

COMMUNICATION 

We do not wish to propose a detailed communication protocol, only to show that the 
requirements are sufficiently bounded that such a protocol could be devised. To develop 
this, we shall discuss communication in terms of "wires” which are vector busses three 
scalars wide, and not allow multiplexing of busses. In a real implementation, data on 
the "vector busses three scalars wide” might be transmitted serially as three sequential 
numbers over only a single real wire. Also in a real implementation, the busses might 
be time multiplexed. Thus in the discussion below, the reader should supply her or his 
own scale factor depending on her or his own implementation image. After developing 
communication models for Figures VI.1b, we will show how the systolic pipelined structures 
of Figures 111.1 and IV.2 have a regular, regularly extensible communication network. 

As in the question of buffering, communication falls into that internal and external to a 
node. The internal communications may be treated as comprising a fixed-size hard-wired 
module in which the “wires” are laid down as dictated by the algorithm. The external 
communication must carry exiernaiiy buffered data back and forth between the process 
nodes and the buffers. 

Bounds may be placed on the internal communication by noting the following: 

(a) as shown below (when number of processors is considered), none 
of the linear forward or backward recursion, the logarithmic forward or 
backward recursion, or the non-propagated, variables have more than 25 
matrix-vector operators, 

(b) no operator has more than two operands, and therefore 

(c) no single module of the above (a) has more than 50 point-to-point 
busses. 

If the forward and backward recursion share processors then this must be increased to a 
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maximum of 35 operators, hence 70 busses. The non-propagated variables have no more 
than 10 operators, hence 20 busses, and these must be included somewhere as noted 
above. We stress that this is an upper bound based only on the total number of processors, 
and in any real implementation the actual number would doubtless be much smaller. This is 
because many intermediate results are used by only one operator, so we expect that many 
operands could be transmitted by abutment, by simply connecting one operator’s output 
directly to the input of the next operator. This will, however, depend on the particular 
implementation layout geometry chosen. The essential point is that there are easily few 
enough wires to route on a point-to-point basis. 

For external communication we consider the paths to and from the buffers separately. 
We essentially argue that since the external buffering requirements were not excessive, the 
communication required to support those requirements will not be excessive. No inter-node 
buffering was required in the linear case, but the variables themselves must be transmitted 
— this requires four vectors on the backward recursion and two on the forward. The 
logarithmic case required buffering at most 18 vectors (54 scalars) per node per (MV -f VA) 
cycle (both backward and forward recursion), hence at most 18 vector busses. The input 
values (18 scalars per node) and the endpoint values (w {) , etc. — 12 scalars at the first 
node of the backward recursion and 6 scalars at the first node of the forward) must be 
communicated. Finaiiy, eacn node must eventually communicate f, and N t to the backward- 
forward recursion buffers. 

The systolic pipelined structure of Figures 111.1 and IV.2 has particularly nice scaling 
properties as n increases. These structures do not fold the nodes or tiers together as does 
Figure VI.la or b. In the logarithmic case, the reader should compare Figure IV.2 with 
Figures VI.2 and VI.3. All show almost identical structure, but in different fashion. Figure 
VI.2 shows Figure IV.2 expanded to include the forward recursion, and Figure VI.3 maps this 
into a regular rectilinear array. This array is also regularly extensible as shown in Figure 
IV.4. In all cases, circles are processor nodes which implement one complete node of the 
logarithmic algorithm. Additionally, in Figure VI.3 squares represent passive buffers which 
do no more than perform buffering for the variables transmitted from node to node, or from 
backward to forward recursion. In the linear case the regular, regularly extensible array 
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structure can be clearly seen from Figure 111.1. Passive buffering is only required between 
backward and forward recursion processors. (We assume that the non-propagated variables 
are computed from the propagated variables within the circles as appropriate, in the manner 
discussed above). All variables pertaining to the same time slice are thus calculated in the 
same node (linear) or tier (logarithmic), and progress node by node (tier by tier) until they 
finally emerge as the desired torques. Each node or tier requires [MV + VA) — 4 Flops to 
complete, so if different successive input values were presented at intervals of 4 Flops at the 
bottom, corresponding motor torques would emerge from the top at intervals of 4 Flops. It 
seems likely that the speed would be bounded by the input/output requirements of the host 
system. 

The thing to notice about the linear case in Figure 111.1 is that it is possible to restrict the 
total width of the busses between successive nodes to the width required for the inter-node 
communication of any single node, and as argued above this is finite and small. In the 
logarithmic case (Figure VI.3), the maximum total width of the busses may be restricted to 
twice this. Thus as the structure is scaled upward (i.e. as n becomes arbitrarily large), the 
amount of area consumed by busses remains a constant fraction of the total area. 

NUMBER OF PROCESSORS 

Consider next the number of processors required. Tables 111.1 and V.1 were constructed 
to represent exactly each operation at a node exactly once (except that as noted in Section 
V, W z , y and R x , y are calculated in the same way on both the logarithmic backward and 
forward recursions, so for conciseness were shown only on the backward — here they must 
be counted in the forward recursion). 

Since the time denotation given in Tables 111.1 and V.1 [MV, VA, VC, etc.) also 
denotes the operation, we see that to propagate the recursive variables through a single 
node requires 
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Requirements Per Node — Propagated Variables Only 

Algorithm 

MV 

VA 

VC 

SV 

MM 

Total 

Linear Bkwd. N.E. 

3 

5 

4 



12 

Linear Fwd. N.E. 

2 

4 

2 



8 


Logarithmic Bkwd. N.E. 

6 

12 

4 

1 

1 

24 

Logarithmic Fwd. N.E. 

5 

4 

1 

j 

1 

11 


Additionally, one must also compute the non-propagated variables (r;, F t , N it and r, from 
Table 111.1), which is the same for both the linear and the logarithmic cases. 


Requirements Per Node — Non-Propagated Variables Only 

Algorithm 

MV 

VA 

VC 

SV 

MM 

Total 

Backward Recursion 

2 

3 

4 

1 


10 

Forward Recursion 






0 


The total per-node requirements are thus 


Requirements Per Node 

— Tota 

Is 

Algorithm 

MV 

VA 

VC 

SV 

MM 

Total 

Linear Bkwd. N.E. 

5 

8 

8 

1 


22 

Linear Fwd. N.E. 

2 

4 

2 



8 


Logarithmic Bkwd. N.E. 

8 

15 

8 

2 

1 

34 

Logarithmic Fwd. N.E. 

5 

4 

1 


1 

11 


An interesting optimization is possible in the logarithmic case if it is not required to 
model the rotation of the earth and if n is even. Recall from Section VI that the recursion 
was grounded in 0(_i), and that we considered z(—i) as a vector pointing through the North 
Pole, etc. This was primarily done to provide a firm and unambiguous grounding for the 
mathematical analysis, and in many cases may not be required. The logarithmic algorithm 
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requires one processor node for each pair (or fraction thereof) of input nodes. If n is 
odd then [(n/2)] = [(n + l)/2] and the number of processor nodes is (n-f1)/2 regardless. 
However, if n is even then [(n/2)] = [(n-f l)/2] — 1 , and by grounding the recursion in O 0 
we may eliminate one processor node. Except perhaps in satellite applications, typically this 
loses nothing interesting anyway. 

An implementation might involve a special-purpose VLSI chip capable of handling general 
vector arithmetic up to and including matrix-vector multiplication. (This is chosen to be 
midway between two alternatives — a matrix-matrix chip would be larger and could be 
implemented using three of the proposed chips, while a vector-vector chip would be smaller 
but three could implement the chip proposed.) The chip would incorporate 18 registers (for 
storing three pairs of 3-vectors), 6 multipliers and 3 adders. It would also need a means of 
sequencing these operations, as well as a means of decoding (perhaps from jumpers wired 
to the pins) which particular vector operation to perform. 

A natural candidate is a datapath chip, as in Barrett et al.[3]. As shown in Figure VI.5, it 
consists of a Programmable Logic Array (PLA) which decodes micro-instructions to produce 
control signals driving computational and storage elements attached to several common data 
busses. By controlling when the storage elements read and write which busses, and when 
and from which busses the computational elements obtain their operands and output their 
roQnitQ th« natijro nf tho computation performed on-chip is controlled. Thus a datapath chip 
functions as an easily-customizable micro-CPU. This will be explored more fully in the next 
section. 

The reader can gain some intuitive feel for an implementation by turning to Table 111.2, 
and imagining that the pages represent printed-circuit boards and the indicated operations 
represent chips. The structure so composed would implement the fully systolic pipelined 
linear architecture for n — 4. As suggested in the next section, several such Vector 
Arithmetic Modular Processors (VAMPs) might be put on a single package, depending on 
fabrication and processing technology, so the actual chip count may be lower. To avoid 
these dependencies, the discussion following will be conducted in terms of VAMPs, each 
capable of computing one vector result, and three together of computing a matrix-matrix 
multiplication. 
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Below, “Single Device” (single node, linear case, or tier, logarithmic) is a non-systolic 
configuration. “Systolic Pipeline” in the logarithmic case requires only seven nodes total on 
each of the forward and backward recursions, in light of the optimization discussed above 
for n = 6 applied to Figure IV.2, plus six more if the a — b case is handled as in Appendix 
D. The algorithms would require the following number of VAMPs 


VAMPs* — Propagated Variables Only (n = 6) 

--1 

Algorithm 

Per Node 

Tier, n = 6 

Single Device 

Systolic Pipeline 

Linear Bkwd. 

12 

* 

12 

72 

Linear Fwd. 

8 

* 

8 

48 

— 

Logarithmic Bkwd. 

26 

78 

78 

338 

Logarithmic Fwd. 

13 

39 

39 

169 


* Vector Arithmetic Modular Processors — see also next section 


VAMPs — Non-Propagated Variables Only (n = 6) 

Algorithm 

Per Node 

Tier, n = 6 

Single Device 

Systolic Pipeline 

Linear Bkwd. 

10 

* 

10 

60 

Linear Fwd. 

0 

* 

0 

0 


Logarithmic Bkwd. 

10 

20 

20 

60 

Logarithmic Fwd. 

0 

0 

0 

0 


The total requirements are thus 


VAM 

Ps — Totals, n — 6 

Algorithm 

Per Node 

Tier, n = 6 

Single Device 


Linear Bkwd. 

22 

- 1 

* 

22 

132 

Linear Fwd. 

8 

* 

8 

48 

Logarithmic Bkwd. 

36 

98 

98 

398 

Logarithmic Fwd. 

13 

39 

39 

169 
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Figure VI.la — Non-Systolic, Processors Shared, All Joints 



Output 

Torques 


Figure VI.1b — Non-Systolic, Forward and Backward Recursion Processors Also Shared 
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Figure VI.3 — WSI Communication Structure of Logarithmic Recursion 

(Rectilinear Array) 
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7. A ROBOT CHIP 


Having seen that the global physical requirements are not excessive, it becomes interesting to 
look more closely at VLSI chip structures capable of physically implementing the computation. 
It is clear from the discussion above that we require a chip capable of performing general 
matrix-vector arithmetic. Within this section we investigate the computation and control 
architecture supporting this requirement. We will sketch one architecture suitable for our 
purposes, though of course there are others. As in the previous section we imagine that 
any actual physical implementation will differ in many particular details from those presented 
here, but not by so much as to render the general conclusions inapplicable. 

Obviously such a chip will be more broadly applicable than just the inverse dynamics 
computation, however, and we will seek to maintain a high degree of flexibility in our 
implementation. In particular, we will seek to insure that: 

(a) all chips used be identical, or at least interchangeable, so that 
only one chip-type is required, 

(b) the particular operation which any chip performs be programmable, 

(c) the length of the basic time cycle, and the points during the cycle 
when operands are read or the result written to busses, and which busses, 
be programmable, 

(d) the host computer be able to dynamically reprogram (b) and (c) 
at will by simply writing the appropriate values to the n input buffers, in 
exactly the way data is written to the device, and 

(e) the control flexibility of (a) through (d) must not use any com¬ 
munication wires or buffering other than those already required to support 
the computation and discussed in the previous section. 

Later in the section we will consider modifications to the bussing scheme discussed in the 
preceding section. By relaxing (e), we will sketch an architecture whereby 

(f) the sources of the operands which any particular chip accepts are 
programmable (as in (d)), and 
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(g) a limited error-correcting capability may be incorporated, so 
that correct computation can often be automatically continued following 
individual chip failures. 

It should be explicitly observed, however, that the architecture presented is incapable of 
matrix inversion or similar operations (e.g., Gaussian elimination). It would also be useful 
to incorporate some of the common trigonometric functions and inverses, as well as square 
root. These capabilities are required for many interesting applications, but require much 
more complex control and condition testing than we need for basic matrix-vector arithmetic. 

The basic unit of most matrix-vector arithmetic is the vector dot (inner) product. Matrix- 
vector multiplication is accomplished by forming the dot product of each of the rows of the 
matrix with the vector, and matrix-matrix multiplication by forming the dot product of each 
row of the first matrix with each column of the second. Other familiar operations (vector 
cross (outer) product, vector addition, etc.) may be performed with different sequencing of 
the basic dot product hardware. 

Our strategy will be to compose a primitive module capable of computing one coordinate 
of the result (i.e., one vector dot product). If we then group several such primitive 
modules together, along with suitable control, we can implement all necessary matrix-vector 
operations. The number of primitive modules per chip is a design decision; the control 
mechanisms we develop support different choices. Nine primitive modules are sufficient 
for a matrix-matrix multiplication, and three for a matrix-vector multiplication. The basic 
architecture established, we display control sequences implementing the various operations. 
Next we embed the mechanism in a timing structure governing when in the cycle operands 
are read and the result computed and written. The internal structure explicated, we discuss 
how the control information might be loaded under programmed control from the host. 
Finally, we indicate how the bus implementation of chip interconnect might be extended 
to allow different algorithms to be dynamically programmed, and a limited error-correcting 
capability. 

As shown in Figure VI.5, a datapath[3] is a bus-oriented architecture composed of 
stacked computational elements, busses running through them, and a centralized control. 
Typically a primitive cell, one bit wide, is replicated in the horizontal direction to the 
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width (in bits) of the datapath, yielding one computational element (such as a register 
or adder). For floating-point arithmetic this is modified slightly so that the mantissa and 
the exponent circuitry are replicated separately. Different computational elements are then 
stacked vertically to form the datapath. One or more busses run vertically through each cell, 
and these may be connected by abutment to vertically adjacent elements forming vertical 
global busses. Control lines run horizontally through each cell, control being frequently 
generated by a Programmable Logic Array (PLA) on the side. 

For datapath elements we assume the following: registers, adders, and multipliers. 
(Other elements, such as comparators and dividers, would also be needed to implement 
matrix inversion or Gaussian elimination.) Each element “talks” to both busses and to both 
vertically adjacent (abutting) neighbors. Thus each element can load operands and dump 
results to and from the busses and adjacent neighbors. Additionally, in order to facilitate 
serial communication between chips and minimize wires, we assume that the registers are 
shift registers, and can shift operands in or results out. We observed while developing upper 
bounds in the preceding section that the "vector busses three scalars wide” might actually 
be implemented as a single multiplexed wire in order to minimize wire-count. "Operand” 
below will typically mean the three scalar values which make up a vector, unless context 
clearly indicates otherwise. (All data transfers within a module are done in parallel, of 
course.' 

Thus the basic cycle will be: 

(a) start cycle, 

(b) load first operand from off-module, a certain delay after the cycle 
starts, 

(c) load second operand from off-module, a second certain delay after 
cycle-start, 

(d) compute result, 

(e) dump result to off-module, another certain delay after cycle-start, 

(f) delay until cycle ends, another certain delay after cycle-start, and 

(g) go to (a). 

As noted above, the basic cycle length of the inverse dynamics algorithm considered herein 
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is MV 4 - VA — 4 Flops, though other algorithms will have other basic cycles. 

Consider the floor-plan for a primitive module, shown as a block diagram in Figure VII.1. 
Assuming that the operands have been properly loaded, it is easy to see that it can compute 
one coordinate of any of the matrix-vector operations listed at the end of Tables 111.1 and V.l. 
(Note that to enable the vector cross product (VC), the two multipliers are placed between 
the two register pairs NOT corresponding to the result component calculated by the primitive 
module.) The sequences of data transfers and computations required for each are given 
in Table VII. 1 . Only two global busses are required, and that the number of multipliers has 
been reduced from three to two (a consequence of the 1 Mult = 1 Addn = 1 Flop timing 
assumption, as noted in the previous section). 

Several of these primitive modules will operate concurrently to produce a matrix or 
vector result; since our vectors are in £ 3 , this number will normally be a multiple of three. 
As in the previous section, we will assume groups of three primitive modules. Figure 
VII.2 shows a floor-plan block diagram, together with control and (multiplexed) off-module 
communication (not shown is the ROM or RAM program storage). Such a device would 
be capable of one matrix-vector multiplication, or the three coordinates of any supported 
vector-vector operation. For purposes of discussion, we will take this Vector Arithmetic 
Modular Processor (VAMP) as our unit of computation. 

The VAMP operand inputs are wired (perhaps through intermediate delay buffers as 
discussed in the preceding section) to the result output of the VAMP computing that 
operand. By synchronizing between source and destination VAMPs the respective dump 
and load delays after cycle-start (recall that we are willing to assume a global clock and 
reset), intermediate results can be passed to successive operators. 

Serial off-module communication between VAMPs might permit several VAMPs to share 
the same package by substantially reducing pin-out. (The trade-off is that it will increase 
communication time, perhaps requiring the insertion of additional delays.) Though the 
physical size is heavily dependent on choice of process technology, width of the data word 
in bits, and other design parameters, we can make some initial estimates of total package 
count based solely on pin-out (with the understanding that these will likely be revised 
upwards depending on design parameters and the particular fabrication process). In most 
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process technologies, each VAMP would require external connections to Power, Ground, 
Clock-1, Clock-2, Global-Reset, Operand-A, Operand-B, and Result. If each VAMP is 
wired separately, a nominal 40-pin package would provide pin-out sufficient for five VAMPs 
to support these eight signals. (Due to process defects, of course, they must be individually 
tested and faulty VAMPs discarded.) If a single Clock input is converted to Clock-1 and 
Clock-2 (or more if needed, e.g. Pre-Charge) on-chip and distributed to each VAMP, and 
if Power, Ground, and Global-Reset are also brought in on one pin each and distributed, 
then a nominal 40-pin package would provide pin-out sufficient for 12 VAMPs. As was seen 
in the preceding section, 11 such 40-pin packages would implement the full systolic pipeline 
for the linear case of n = 6. 

Alternatively the other extreme can be chosen, and each VAMP packaged individually. 
Each VAMP could be put in an 8-pin package, resulting in very small cheap chips. This 
would have the further advantage of not requiring innovative packaging. 

The control must specify the delays associated with communication and computation 
of the result, and which operation to perform. This is shown schematically in Figure VII.3. 
The delays are easily implemented by counters and comparators. A Cycle-Counter is zeroed 
at each cycle-start and incremented at each Flop. The delay associated with each register 
ioad/dump is compared to the Cycle-Counter and the register loaded on equality. By 
insuring mat cycie-start occurs synchronously on aii cmps, communication timing between 
operand sources and destinations can be coordinated programmatically. In turn, in virtue of 
accepting a global clock signal we can insure simultaneous global cycle-start (e.g., perhaps 
by bringing Global-Reset to True for one clock period only at the beginning of each cycle). 
This globally resets counters to zero, and insures that accidentally dropping a bit or missing 
a clock-tick does not throw a module permanently out of step. 

It is sufficient if the device is able to execute operations from only a small fixed repertoire, 
viz. those at the end of Tables 111.1 and V.l. This being the case, the microcode instructions 
needed for each item in the repertoire may be stored on-module in Read-Only-Memory 
(ROM). The operation to be performed can then be specified as a pointer into ROM, e.g. 
as the high-order address bits or as an index pointer. 

If we were designing a very general machine we would certainly want the microcode 
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to be loaded under program control from the host. Though such capability is not needed 
for most matrix-vector algorithms, we observe that Random-Access-Memory (RAM) could 
be loaded in much the same way as we will load control information. The pointer into 
ROM would then be augmented with an indicator bit selecting RAM or ROM, and all would 
proceed as sketched for the ROM-only case. 

Thus we see that communication and control can be effected by loading three registers 
with delays for the comparators and one register with a ROM pointer. These four values will 
be referred to below as the module’s "program”. Next, consider how these control registers 
might be loaded under programmed control from the host. (For a prototype version one 
might simply bring the control in from off-chip, each control register bit corresponding to a 
pin wired by jumpers to either power or ground.) 

Since the same wires are to be used for both data and control, the chips must have 
a way to distinguish which is which. Though it would be possible to tag each word with a 
control bit, it is preferable to use the Global-Reset signal. In normal operation (Data Mode) 
Global-Reset is (for example) brought True for exactly one clock-tick to mark cycle-start, 
then immediately returned to False. If instead it is kept True, all chips enter Program Mode. 
While in Program Mode they will use the same wires in much the same way, but treat the 
words as program data rather than computational data. In Program Mode, Global-Reset 
is (for example) brought False for exactly one clock-tick to mark program-cycle-start, then 
immediately returned to True. 

The system must be programmable from any conceivable system state, in particular from 
random or partially programmed states. Thus the delay registers must be assumed invalid. 
All transfers occur immediately following program-cycle-start. The host writes successive 
program data words to the input ports and these are clocked through the network. Each 
module will clock in program data as operands, operate on it as discussed below, and clock 
out program data as result. When the module clocks in the program data containing its 
own module program, that data will be loaded into its control register(s) as below. After 
all VAMPs have been programmed in this fashion, Global-Reset can be brought and kept 
low forcing Data Mode. Thereafter normal data computation can occur through the fully 
programmed network. 
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Thus we must insure that every VAMP is directly or indirectly accessible to the host, 
and that each VAMP can recognize its own program when seen. The first condition is 
satisfied by the connected directed acyclic graph nature of the network; any module not at 
least indirectly accessible to the host could never receive operands and hence could not 
participate in the computation. 

One way to allow each module to recognize its own program is to assign to each one 
a unique number, “burned in” as in Programmable-Read-Only-Memory (PROM) before the 
chip is ever inserted into the network (a prototype might simply have jumpers wired to pins). 
The host then simply writes pairs of (VAMP-number, VAMP-program) to the network inputs, 
and these are clocked through the network. When a module recognizes its own number 
in Operand-A (i.e. the first operand), it loads the associated program. For example, the 
VAMP-number might be in the first scalar component of the vector, the VAMP-program 
contained in the remaining two scalars. Variations on this scheme could also allow RAM to 
be loaded, perhaps using the second scalar component as an address and the third as the 
microcode word to store there. The VAMP outputs Operand-A (unchanged) as its Result. 
The network can be programmed by writing, to each input port, one such number-program 
pair for each module in the network. Chips are interchangeable simply by changing the 
VAMP-numbers transmitted by the host. 

We conclude this section by very briefly sketching extensions permitting a limited 
dynamically programmed communication system, and a limited error-correcting capability. 
Neither of these are required for the Inverse Dynamics algorithm, but would render the 
device more useful. 

The communication requirements of parallel algorithms are often mostly local, with a 
few long-distance data transfers which must also be supported. This is the case in the 
Inverse Dynamics as well. The Mostly Local Bus (MLB) in Figure VII.4 is intended to 
support both high local band-width and sparse but necessary long-distance communication. 
It can be used by the host to gain programmatic control of the communication network 
implemented by the array of processors. Programmatic host control of the implemented 
communication structure (the data dependency graph), together with host control of the 
operation performed at each node, allows the same physical device to efficiently implement 
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many different algorithms without physical reconfiguration or rewiring. 

Refer to Figure VII.4 (MLB). It depicts a multi-tiered bus, with some of the tiers composed 
of many short busses, some tiers composed of several medium-length busses, and some tiers 
composed of a very few long busses. Conceptually, imagine that each module is potentially 
connected to each bus directly in front of and back of it. (In any real implementation, small 
groups of modules would actually share long-distance drivers to keep the area penalty low 
for sparse long-distance traffic.) A communication network structure can be imposed by 
specifying when each module reads or writes which bus. By insuring that the destination 
module reads the same bus of the same tier at the same time the source module writes to 
that bus, any two modules which connect to the same MLB can communicate. Different 
source/destination pairs can share the same bus provided the communications occur at 
different times in the basic time cycle of the algorithm. Within the framework we have 
developed this means adjoining, to the registers governing at what time busses are read 
and written, additional registers governing which tier is targetted. These registers would be 
loaded exactly as already described. 

In those algorithms susceptible to this architecture, most communciation will be local 
and can occur on the many short-interval busses, achieving high local bandwidth. The few 
long-distance lines exist to support sparse necessary long-distance communication, but if 
used too heavily the system performance degrades. This degradation can be made graceful, 
however, by inserting extra length into the basic cycle. This creates extra slots into which 
conflicting communications could be transferred. Note that this communciation structure is 
most suitable for algorithms having a straight-line systolic nature, such as characterizes the 
Inverse Dynamics. In cases where it is applicable, it permits relay-free communication with 
limited pin-out or bus connections, and would be most suited to WSI or systems with large 
numbers of processors. The relay-free character of the communication differs from schemes 
based on twisted n-cubes, which may require up to O(logn) relay delays between any two 
processors. 

It would be possible to “tune” the MLB to the communication requirements of a particular 
algorithm (while preserving the capability to dynamically reconfigure so as to perform others) 
by adjusting the ratios of bus lengths in successive tiers. For illustrative purposes, Figure 
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VII.4 shows the number of processors served by each bus doubling at every second tier, with 
the even and odd tiers offset from each other so that each processor is roughly symmetric 
in communicative power each direction. The number of tiers thus grows as 0(log 2 ) of the 
number of processors. To tune the MLB to a particular algorithm, one would require that 
the number of busses of a given length was roughly proportional to the fraction of messages 
communicated a distance of approximately that length. This rough guide must be refined 
further if the average message length varies with distance. 

Finally, we note that most of the machinery necessary to support a limited error correcting 
capability has been developed. This will permit several cases of gross individual chip failures 
to be caught and flagged, with the device automatically resuming correct computation 
following error correction. There are hardware error classes for which this capability will 
not apply, of course, some examples being a direct short between power and ground (it is 
difficult to automatically recover from this in any case), occassional intermittent faults, or 
failures in the error-correcting circuitry itself. Since the error-correcting circuitry is a small 
fraction of the total chip circuitry, however, the probability that it will fail first is likely also 
to be small. 

Order the VAMPs, so that each has a unique successor and a unique predecessor 
except the two end VAMPs. Basically we suggest a mechanism in which each VAMP checks 
its successor after having been checked by its predecessor. It thereafter plays the role of 
its successor in the computation while its successor checks the next VAMP, and so forth. 
After each VAMP has been checked they return to a normal configuration, and repeat the 
process. If a faulty computation should be detected in the checking process, the faulty 
VAMP will be inhibited by its predecessor and excised from the computational structure of 
the device. Its predecessor will continue to play its role in the computation until the faulty 
component can be replaced. 

In addition to those discussed above, each VAMP would also have a second set of 
inputs Operand-A', Operand-B', inputs for Test-in, Test-back-in and Inhibit-in, outputs 
for Test-out, Test-back-out and Inhibit-out, and input/output Result'. Each VAMP’s 
Operand-A' and Operand-B' inputs are connected to the Operand-A and Operand-B 
inputs, and its Result' input/output connected to the Result output, of its successor. Its Test- 
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in and Inhibit-in inputs are connected to the Test-out and Inhibit-out of its predecessor. Its 
Test-back-in input is connected to the Test-back-out output of its successor. See Figure 
VII.5. 

These new inputs and outputs are used as follows. During normal operation of a 
VAMP, Test-in, Test-back-in, Inhibit-in, Test-out, Test-back-out and Inhibit-out will 
all be False. The VAMP will load as its operands the inputs Operand-A and Operand-B, 
and dump its result on the output Result. The input/output Result' will be tri-stated. 
Eventually, the VAMP’s predecessor will bring Test-in to True at the start of a cycle. The 
predecessor is now filling the VAMP’s computational role, and the VAMP is free to check its 
successor. It does this by loading its operands for that cycle from the inputs Operand-A' 
and Operand-B', performing the same calculation as its successor, and comparing its result 
to its successor’s output, which is loaded through the VAMP’s input/output Result'. (To 
do this, of course, it must have a second set of control registers, correctly loaded to match 
those of its successor.) The output Result is tri-stated. 

If the two results do not match, then the successor is assumed to be faulty (recall that 
the VAMP itself has just been checked in the immediately preceding cycle). The VAMP 
excises its successor from the computation by bringing Inhibit-out to True. This causes 
its successor to tri-state its data outputs and bring its control outputs False, which is 
implemented by very simple gate logic at each pad driven directly from Inhibit-in. The 
VAMP will continue to fill its successor’s role in the computation until the device is brought 
down for maintenance. At that time it can broadcast its VAMP-number through the net in 
Maintenance Mode as a fault-finder, somewhat akin to the way programs were broadcast 
in Program Mode, except that instead of outputting Operand A unchanged each VAMP 
outputs any Operand which contains a fault notification. The VAMP-numbers corresponding 
to fault-finders will therefore emerge from the net where the motor torques emerge in Data 
Mode, and the faulty VAMPs can be replaced. 

If the two results match, then the successor is assumed to be not faulty. The VAMP will 
bring Test-out to True at the start of the next cycle, causing its successor to repeat the 
operation just described. Since the VAMP must now fill its successor’s role, it will continue 
to take as its operands the inputs Operand-A' and Operand-B', but in addition will now 
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dump its result on the input/output Result'. The output Result remains tristated. 

The VAMP will return to normal operation after its input Test-back-in (connected to 
its successor’s output Test-back-out) is brought True for one cycle. The input/output 
Result' is tri stated, and its output Test-back-out is brought True for one cycle (causing 
its predecessor to resume normal operation in the same sequence). On the next cycle the 
VAMP will again load as operands the inputs Operand-A and Operand-B, and dump its 
result on the output Result. 

So far we have described the action of VAMPs in the middle of the order. It remains 
to describe the ends. The last VAMP can simply have its output Test-out fed back into its 
input Test-back-in. This causes the sequence to reverse when the last VAMP is reached. 
To check the first VAMP we must introduce an extraneous VAMP (and properly speaking, a 
second extraneous VAMP to check that one). The first extraneous VAMP can simply have 
its output Test-back-out fed back into its input Test-in. 
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Table VII.1 — Primitive Module Operation Sequencing 

(refer to Figure VII.1) 

(This primitive module calculates the 3— scalar component of the result vector) 

Operation_Flops_Pest._ Source _ 


VA: 1 

(result in adder/subtractor) 

BUS-1 

BUS-2 

ADDER 

<= 

<= 

4= 

REG: OP-A-3 

REG: OP-B-3 

BUS-1, BUS-2 

SV: 1 

BUS-1 

<= 

REG: OP-A-1 


BUS-2 

4= 

REG: OP-B-3 


MULT-1 

4= 

BUS-1, BUS-2 

(scalar multiplier in 1— scalar component of Operand A, by convention) 

(result in multiplier-1) 




VD, MV, MM: 1 

BUS-1 


REG: OP-A-3 


BUS-2 

<= 

REG: OP-B-3 


MULT-1 

4= 

BUS-1, BUS-2 


MULT-2 

4= 

REGS: OP-A-2, OP-B-2 

2 

BUS-1 

4= 

MULT-1 


BUS-2 

4= 

MULT-2 


ADDER 

4= 

BUS-1, BUS-2 


MULT-1 

4- 

□erne- Qp _a_ ■{ np d -i 

3 

BUS-1 

4= 

MULT-1 


BUS-2 

4= 

ADDER 


ADDER 

4= 

BUS-1, BUS-2 


(alternatively, adder can accumulate instead of being totally bus-driven) 
(result in multiplier-1) 


VC: 1 BUS-1 <= REG: OP-B-2 

BUS-2 <= REG: OP-B-1 

MULT-1 4= BUS-1, REG: OP-A-1 

MULT-2 <= BUS-2, REG: OP-A-2 

2 BUS-1 <= MULT-1 

BUS-2 <= MULT-2 

SUBTRACTOR 4= BUS-1, BUS-2 

(subtractor computes BUS-1 — BUS-2) 

(result in adder/subtractor) 
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(To enable VC, the two multipliers are placed between the two register pairs NOT 
corresponding to the result component calculated by the primitive module. Thus this example 
calculates the 3^ scalar component of the result.) 


Figure VII.1 — Primitive Module Block Diagram 
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MATRIX-VECTOR OPERATOR DATAPATH 



Figure VII.2 — Vector Arithmetic Modular Processor (VAMP) 
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CONTROL STRUCTURE 


IN OPERATION, CONTROL MUST GOVERN: 

- 1) When operand R is read in, & to which re 9 s; 

2) Uhen operand B is read in, & to which regs; 

I— 3) Uhen the Result is output, & from which regs; 

r 4) Uhat operation is performed. 


Figure VII.3 — VAMP Control Registers 


83 























| Operand Sources 

Figure VII.5a — Non-Error-Correcting Configuration 


Result Destinations 



Result 

Result' 

lnhibit-ln 

Inhibit-Out 

Test-In VAMP Test-Out 

Test-Back-Out 

Test-Back-In 

Op-A' Op-B' 

Op-A Op-B 


Result 

Result' 

lnhibit-ln 

Inhibit-Out 

Test-In VAMP Test-Out 

Test-Back-Out 

Test-Back-In 

Op-A' Op-B' 

Op-A Op-B 



Figure VII.5b — Limited Automatic Error-Correction 









5. SUGGESTIONS FOR FUTURE EXTENSIONS 


■jz: r: zzzzzzzz- -r. -. -- 

:f;=.r^==ir“^=== 

=r:=r:=:r^^—•== 

:rr::r.:r,r—.. *». 

™ ......... - ~»~ “•"■"’™■ — 

.. . rnmnute the Inverse Dynamics in approximately real time using 

neural structure. We close with a few remarks concerning the possibility of generalizing the 

Ofloe(n')) embedding to other recursive structures. 

2 question o, how to incotpota.e dynamical considerations into on-line ,ra,ec.ory 

p ,anl::an area o, open research W HoiierbachjfSj^ — 

velocity o. a trajectory so as to remain within torque I,nuts. *« **» ° * 

:r——tsr~ 

case of time-optima, control along a specified trajectory, but do not How the pa 
varied in space and require moderately intensive computation by the os 

Consider Figure Vlll.t, which might be taken to represent a shit, regtster of depth - 
Data records pushed in at the top progresses through the shift register with times ep 

emerge out the bottom time m\ later. 

imagine that the shift register was interfaced to a memory board m such a way 
..,00k like., memory ,0 the host, so that any ot the locations could be read or wntten as 
Tory A, each times,ep x the 0 * row would be shitted ou, the bottom, the whoie array 
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shifted down one, and the host would write a new set of values at the top. At any time, of 
course, the host could read or write any of the array locations. 

Now imagine that some of the array locations in the j— row correspond to “input 
values” ({q(j\), g(yx), g(iX)}) and some locations correspond to “motor torques” ({r(j'X)}); 
and that the “torque” locations are really hard wired (through such a dynamics box as we 
have described) to the "input value” locations. The box continuously computes, for each 
set of input values in the shift register, the corresponding torques — if the host changes 
an input value anywhere in the array, the torques corresponding to the new set of values 
automagically appear. Set X equal to the refresh rate at which new motor torques are 
supplied to the manipulator. Now as each value is shifted out the bottom, imagine that 
a demon catches it and passes the torques to the manipulator motors — simultaneously 
another demon fills in a new set of input values at the top and the torques appear. We might 
as well have the inertial Cartesian coordinates and velocities of each link endpoint appear 
too; since they are manifestly simpler to calculate from the same input data as the inverse 
dynamics, the fractional cost to include them is small. 

This now becomes a fairly explicit representation of many interesting characteristics of 
the manipulator trajectory for the next m time periods. If the host inspects the values but 
makes no changes, those m values will be shifted to the manipulator, one by one; and that 
will define the path the manipulator will follow for the next m periods. Alternatively the 
host may change one or more values causing the corresponding changes in the torques; 
the manipulator would then follow the revised course. This arrangement might be useful 
in helping to solve the problem of how to incorporate dynamical considerations into on-line 
trajectory planning, acting to help optimize a crude trajectory generated by a higher-level 
planner. 

One extension to this basic idea would be to use our box to also calculate the input 
value joint velocities and positions ({q{t), q{t)}) directly from the acceleration profile ({?(<)}), 
rather than having them set directly by the host. This would avoid the embarassing possibility 
of (e.g.) a trajectory requiring an instantaneous step discontinuity in manipulator position, 
as well as reduce computational demands on the host. Another extension would be to have 
several such shift registers; one could thereby sweep out an envelope around the proposed 
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trajectory, exploring in parallel possible futures and interpolating between them. 

Another interesting area is the overlap with human psychology and neurophysiology. 
Torre and Poggio[4l],[22] have shown the theoretical possibility that a neuron could perform 
an analog multiplication in its dendritic branches within about a millisecond (this capability 
was originally postulated to be necessary in order to explain certain aspects of visual 
processing). The analysis treats the dendritic branches according to membrane theory in 
passive RC cables. Where g , and g 2 are inputs they are able to produce a term proportional 
to {gi — a?ig 2 )- By additionally connecting the g\ input appropriately to a side branch on 
the dendritic pathway to the axon it is possible to show theoretical cancellation of the linear 
term. Analog additions may be performed as in classical circuit theory (given appropriate 
arrangements of the dendrites). 

Thus, given the time bounds on the formulations developed above, the nervous system 
might be capable of performing the inverse dynamics calculation in something approximating 
real time. Since the computations are performed in analog by biological components one 
necessarily expects them to be “dirty”, i.e. contaminated with noise, inaccuracies, and 
other errors. If one postulates a large number of such inaccurate devices performing the 
same calculation and averaging the result, however, from fundamental statistical properties 
one may show that the resulting calculation can be made arbitrarily accurate by taking the 
number ot devices arbitrarily large. Also, though the formalisms were developed for a single 
chain of length n, the fact that the time complexity increases only as O(log(n)) suggests the 
possibility that one might control other large systems involving many degrees of freedom 
without paying an exorbitant penalty in real elapsed time. Hollerbach and Flash[17] have 
performed experiments investigating the possibility that human subjects perform some sort of 
dynamic scaling in planning planar arm trajectories. Many associated questions immediately 
arise, of course, such as how the nervous system learns to perform the computation; or, 
if it is hard wired, how the system learns the parameters; and so forth. We do not wish to 
engage in this debate, but only to point out that the computational aspects are tractable. 

Finally, it seems likely that the process of embedding a serial linear-time recursive 
algorithm in a parallel logarithmic-time algorithm is generalizable, certainly at least within the 
context of associative ring operators. We saw that several basic properties were exploited 
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in our analysis above, including the associativity of the ring operators and the capability to 
order the recursive variables in time according to data dependencies. 

The basic strategy followed in Section IV was to expand the closed-form non-recursive 
formula for X a ,b into an equivalent expression involving two similar formulae for X a , k and 
Xfc+i,!,. This was taken to be the combining form for X a<b . In making the expansion, it was 
also necessary to expand and re group expressions for the dependent variables in terms of 
their combining forms. We speculate that the associativity of the operators permitted the 
necessary expansion of the dependent variables. Linearity is not necessary, as it is possible 
to devise a combining form for the (non-linear) (a+b) from expressions for (^) and (^): 

M LJMiL 

= 0(0 

Thus it might be possible to devise logarithmic-time parallel algorithms from linear-time serial 
ones even if scalar division is involved, and perhaps other non-linear (non-ring) operators. 
Of particular interest would be a general mechanism for logarithmic embedding of linear 
algorithms involving matrix inversion (or solutions of simultaneous linear equations, e.g. 
Gaussian elimination) at each step. This would render a much wider class of algorithms 
accessible to logarithmic-time techniques, e.g. perhaps the inverse kinematics or the direct 
(integral) dynamics. 

In Appendix A it is noted that the ability to order the recursive variables according to 
data dependencies 

X t y Y t y Z, > ... 

where we define 

X, y Y, iff Yi does not depend on Xi for any j 
iff dYi/dX) = 0 

guaranteed minimal satisfiability of the relative offset inequalities. We speculate that this 
condition might also be necessary to form logarithmic-time combining forms as well. This 
may arise since in devising the combining forms it was necessary to apply previously-deduced 
combining forms to break apart the constituent dependent variables. Lacking this condition, 
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one may at least imagine a com In which deriving Hie combining form for X a , b requires 
expanding the combining farm for T B >, while d er iv in g the combining form for requires 
doing the same for X a , b - The abiMy to order the recursive worWalealy data dependencies 
insures that at least this particuiar deadtock cannot arise. A more teroal demonstration of 
the applicability of this condition wouklbe useful. 











9. CONCLUSIONS 


We have shown that considerable time savings accrues from embedding the inverse dynamics 
calculation in a parallel computation. A parallel-time algorithm with time complexity only 
logarithmic in the number of joints has been derived. Hardware necessary to implement 
such parallel algorithms has been considered, and the requirements shown to be substantial 
but not excessive. Using the concepts developed, it should be possible to devise a 
device capable of implementing the calculation at a speed primarily bounded by the 
input/output requirements which the algorithm imposes on the host. We have sketched 
speculative extensions to this work in the areas of on-line trajectory planning, psychology 
and neurophysiology, and parallel algorithm theory. 
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Appendix A — Derivation of the Linear Time Offsets and C 

In the following, as in Table 111.1, “Avat7(X,_i) = t" means that variable X t _! is made 
available to the t— node at time t (on the backward recursion; substitute X 1+1 on the 
forward recursion). The abbreviations VA (vector addition), VC (vector cross product), etc., 
are explained in Table HI.1; they denote the time required to perform certain matrix or vector 
operations. 

First we determine the relative delays, or offsets, in variable availability times. We do 
this by requiring the following implication, for each propagated recursive variable: 

Avail(X,) = max(Ai)ai7(X,_i) -f- a, Avail[Yi—i ) 0 , Avail{Z,—\) -f- 6, ...) + 7 

=> Avail(X,) — Avail(X,—j ) -f- a 7 

This condition amounts to saying that nothing delays the availability of a variable longer than 
itself. This allows one to infer 

max(Avai7(X,_i) + a,Avail(Y{—i) + P,Avail(Zi—i) + 6 ,...) — Avail(X t —i) + a 

from which immediately follow the inequalities 

Ava%l(Xi—i) -f- a > Avail(Y,—i) -f- 0 
Avail(Xi^i) + a > Avail(Z,—i) + 6 


That the set of all such inferrable inequalities is globally satisfiable follows from the ability 
to globally order the recursion variables 

XibYi^Zi* ... 


where we define 

X, X Yi iff Y, does not depend on X 3 for any j 
iff dY l /BX, = 0. 

In the linear Newton-Euler recursion, for example, we have 


Hi > Ji > Pi > Wi > Ui 

and the non-propagated variables could be included in the chain if desired. 
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By considering starting conditions (initiation of the calculation) one can generate another 
set of inequalities of the form 


Avail{X x ) + a' > Avail{Y x ) + /?' 
Avail{X i) + a' > Avail(Zi) -f- 6' 


Since both sets of inequalities are satisfiable it is possible to reach a point of minimum 
satisfaction (how dreary... why would one wish to do so?). This is the unique point (Atmt7(Xi), 
Avails), Avail{Z\ ),...) which satisfies both sets of inequalities above, and also minimizes 
Avail{X\ ) for each variable X. This defines the relative offsets which will minimize the total 
computational time. 

On the backward Newton-Euler recursion only u t , w,-, and p, need be considered. This 
is because r,, F lt and N, are merely passed directly to forward recursion nodes. From Table 
111 . 1 , 

Avail(u>i) = max(Avai7(wi_i) -T VC -f- VA, Avail(u t — i)) -T MV + VA 
Thus for minimum delay {MV + VA) in propagating w, we require that 

max(Auai7(w,_i) + VC + V A,Avail{bJ,—\)) = Avat7(w,_i), 


hence 

Availfai— i) > Avail(uJi—i ) + VC + VA (*) 


Similarly, 

Avail(p,) = max(Auai7(aj,) + 2 VC -\- VA, 

Avail{u)i) -T VC + VA, 

Avail{p ,-_!) + MV) + VA 
= max(At-o l 7(w t _ 1 ) + MV + 2VC + 2 VA, 

(max(At;ai7(w,_i), Avail^t) + VC + VA) + MV + VA) + VC + VA, 
Availip^A- MV) + VA 
= max(Auai7(w t _x) + MV + 2 VC + 3VA, 

Avail(u>i—i) + MV + VC + 2VA, 

Auat7(p t _i) + MV) + VA 
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And so the minimal delay in propagating p, requires both 

Avail(p ,•_!) > Auai/(w ! _ 1 ) + 2 VC + 3 VA 
AvoiZ(p,_ 1 ) > Avail(uj,.-\) -f VC + 2VA. 

If equality holds in (*) then these two bounds are equivalent. 

On the forward recursion only /, and n, need be considered. 

Avail{n l ) — ma.x(Avail(f i+i) + VC -f- VA, Avail(rii^. i)) -f MV -f VA 


and so 

Auai/(n,4.1) > Avail(f,+i)-\-VC -f VA. (***) 

Next we determine the constant C by showing when 7), the last generalized joint force 
of the forward recursion, becomes available as output. From Table 111.1, assuming all input 
values become available simultaneously at time 7 = 0, 

Avail(uj) = MV + VA 
Avail(Cj\) = MV + VC + 2VA 

Avai7(pj) = max(Auar7(w]) + VC + VA,Avail(u>i) -j- 2 VC + VA,MV ) -j- VA) 

= + 2 VC + 4 VA 

Note that these satisfy (*) and (**) so the propagation time is (MV + VA) per node. 

Avail(w n ) = (n — 1)(MF + V A) + Avail(ui) 

= (n — 1 )(MV -f VA) + MV + VA 

Avail(u n ) = (n - 1 )(MV + VA) + MV + VC + 2 VA 

Avail(p n ) = (n— 1J(A*F + VA) + MV + 2VG’ + 4VA 

Avail(f n ) — ma.x(Avail(p n ), Avail(u> n ) -j- VC + VA, Avail(u> n ) -j- 2VC + VA) + VA 
= (n — l)(MV + VA) + MV + 2 VC + 5VA 
Avail(F n ) = Avail(r n ) + SV 

= (n - 1)(MV + VA) + SV + MV + 2 VC + 5VA 
Avail(N n ) — ma.x(Avail(u n ) + VC,Avail(u n )) + MV -f VA 
= (n - 1 )(MV + VA) + 2 MV + VC + 3 VA. 

Thereafter, on the forward recursion, (recognizing that Avail(f n+1 ) = Avail(n n+ i) — 0), 

Avail(f n ) — Avail(F n ) -f- VA 

= (n - l)(Af V + VA) + SV + MV + 2 VC + 6VA. 

Avat7(n„) = max(Avail(F n ) -j- VC, Avail(N n )) + 3 VA 
= (n — l)(My + VA) + MV + VC + 6 VA 
+ max(SV + 2 VC + 2VA, MV) 

= (n - 1)(MV + VA) + SV + MV + 3 VC + 8 VA 
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These two expressions satisfy (***) so propagation occurs at the maximum rate of (MV +VA) 
per node. 

Availim) = 2(n - 1 )(MV + VA) + SV + MV + 3 VC + 8 VA 
Avail(r\) < m&x(Avail(f ]), Avail(n\)) 

= 2 (n - 1 )(MV + VA) + SV + MV + 3 VC + 8 VA 

(Actually, Avail^i) will depend on whether joint i is translational (= Avail(fi)) or rotational 
(= Availing), but we assume here rotational, the worst case.) 

Assuming a maximally parallel implementation, we would have: 

VA= 1 Addn 
SV= 1 Mult 
VC= 1 Mult + 1 
MV= 1 Mul( + 2 

So 

Avail(T\) < 

suffice. 


(using 3 adders) 

(using 3 multipliers) 

Addn (using 6 multipliers and 3 adders) 
Addns (using 9 multipliers and 3 adders). 

(2 n -j- 3) Mults + (6n + 7) Addns 
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Appendix B — Derivation of the Logarithmic Recursive Formulae 


NEWTON-EULER BACKWARD RECURSION VARIABLES: 

The derivation of the logarithmic combining form for w, has been developed in the text. 
Next, we show' that u, satisfies the following closed-form formula: 

l 

= Y1 w L a j( z J-^j + ^-1 x z 3 -'<h) 

J—o 

as it is a fixed-point of the recursive formula for d>, in Table 1.1: 

(,-i 

w« = AM S W J,,-i a j( Z 3-^3 + X *7-1 fc) 

X J=0 

4- + w,_i X 

= Aj + &i(zi —] ij 7 + w,_i X Zi-iqi)j 

As in the case of u,, we take w 0 = i) 9 0 - Most applications of interest will have 

wo = <j 0 = 

in order to matcn at a — 0 and 6 = t, 

b 

Ua,b = X) W l,b a A Z 3-'h + "a,(j-l) X *>-!$>) 

k 

— X* cr j( 2 j—l’9j + W a ,(;_l) X i9j) 

>=n 

i> 

>=fc+l 

6 

— f^ffc-fl),bWa,fc + -f ((^?fc + X),(j —l) W °- fc ) X *> —1$>) 

J=K-f 1 

= W^ + 1 , ti « a> fc + (W(fc + I) l6 w a ,*) X W(*, + 1),6 + W (fc + 1)>6 

which is the combining form for u> a , b . 


101 




To derive the combining form for p u6 , it is necessary to create the auxiliary variables 


Qa,h = J2 W J,b*) Z ]~ 

J—« 

= Wl +hh Q a , k + Q(fc-fi),fc 

b 

Ra.b = ^j+1 ,bPj 

= W 7 k + 1 ' h R tl ,k + R(k + 1), b 

b 

Sa.b = ^>2 Wj_|- | X P_, ) 

3=n 

= w 7 k+ub s a .k+ J2 ^ + .,fe+ 1J u »^ u ‘+i J )x P ;) 

>=*+> v y 
= (W^-|_i,j, w rc.fc) X /?(/ r _|_i) 1 (, S{ k + l),b 

Next we show that p l satisfies the closed-form expression 

Pi = Wl,p 0 + Yi Wj +l Juj X p* + w> X (w> X P*) 

>=i ' 

+ 2j_!+ 2 Wj X AjJ 

ThiQ i<5 3 fiYAH nnint nf tho r0C'J r Sl w 0 forrTVjIS for p t SS ShOWP 

Pi = Aj(wl,^p a + £ X p* + X (w, X P*) 

+ + 2wj X 

+ Wf +1 Xp' + w.X (w, X p*) 

+ d t (Ajz t - 1?; + 2w t X Afz t _i<7i)) 

= Afp,;_! -f UJ, X P* + Wi X (w { X p‘) 

+ i'q t -f~ X AjZi—i<]i) 

where p 0 is the acceleration of the base. Typically this is the acceleration due to gravity at 
the site. If one took w 0 ^ 0 above then p 0 may also include a term for w 0 x (w 0 x Pq), where 
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p* is a vector from the Earth’s center 0(_i) to the site; this accounts for the centripetal 
acceleration arising from the rotation of the Earth. If one accounts for the gravitational 
acceleration ( g) by taking pi — g at the base, else pf = 0 for i ^ 0 , then the formula may be 
equivalently re-written with greater clarity (covering both cases wo = 0 and wo 7 ^ 0 ) as 


Pi 


+ u, X p* + W 7 X K X p*) 

j=0 ^ 

+ d A A J z i -+ 2u 3 x ?>)) 


This we will take to be the defining closed-form non-recursive formula for p ; . The term 
involving p? is a technical artifice to account for p 0 cleanly, and will vanish in the combining 
form. 


The demonstration of the combining form of p 0ifc will require the vector identity o x (6 X 
c) + (ixa)xc = lix(axc). In order to match at a = 0 and b — i, take 
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Pa,b 


= £ Wj+i./pj + w a ,j X p* + w«,j X (w a ,j X p’) + CjiAjzj-i'qj + 2w 0i> X Aj 

i = a ' 

* / 

= Wfc +llt X] ^J+i.fcl p® + "«o x p* + w “.j x K.j x pj) 

J —a ' 

+ i9j + 2w a>; - X Aj *>—i9>)^ 

+ £ ^J+l./pj + fw'fc + j jWa.fc + (w£ +1)> w a ,*) X W(A-fl),j + W(fc+1),>) X Pj 

i=fc+1 V V 7 

+ (W / T + l,j W “.* + W(/r + l),>) X ((VV[ +1>jWo ,A + W (*-+1).>) X P>) 

+ — + 2 (Wl +i jUa,k + <*>k + l,j) X Aj — 

b 

= ^k + l,bPa,k + P(A- + 1),6 + 0^k + l,b^a,k) X £ ^J+l ,bP> 

>=* +1 

+ (W^fc-j-l.bWa.fc) X f(VV^_|_ li(j W ai A) X £ Wj-fl,bPj 
V J = fc+1 

+ 2 £ f^J+i,6( w (fc+i).; X Pj) + a J Wj + i jh (Ajzj_i9 J )'| J 

,=A4-l'- '/ 


Pa,A = (Wjfc+l,bPa,/c) + P(* + l),6 + (Wfc+l.b^a.fc) X i?(fc + l),b 

^ f \ 

+ (VV^ +1 ,(,Wa,fc) x ^(Wfc_,_ li6 w Q ,A) X J2(fe+i),b + 2(5(* +1 ),t + Qk+i,b)J 


NEWTON-EULER FORWARD RECURSION VARIABLES 

As noted in the text, on the Newton-Euler forward recursion the coordinate matrix 
products of interest will be W„+i, fc+ i instead of W£+-i,b- 

Noting that the numbering runs backward, we see that /; satisfies the non-recursive 
formula 

n 

u = £ 

j = i 

n 

= Ai+i £ Wi+2,jFj-¥Fi 

>==<+! 

= A,_).i/i+i + Fi 


104 



To match at a = i and b — n, 


b 

fa,b = ^ 

j=a 

k b 

= Y, W «+id F J + w a+i,k+i E 

j=a i = fc + l 

= /a.fc 4 W a + 1 ,* + ,/<* + !),<> 

If desired, forces and torques applied by the environment to the manipulator tip may 
be incorporated in a fashion similar to incorporating the acceleration of the base in the 
discussion of p aJl . This will not change the displayed combining forms. 

Similarly, n t satisfies the non-recursive formula 

n, = E Wi-fi,jZ-ZV, 4 4 X Fj + Pj X 

3—i ^ 

— AT, + a, X F, 4 P, X (Ai-i-i/i+i) 

4A+, E w * +z,j 4 4 X Fj 4 Pj X (Aj+i/j+i) j 

>=<+i ^ ' 

= N, 4 4 X F,4 p* X (4+,/ i-f i) 4 A,'+i.n«'+i 

To match at a — i and 6 = n we must have 

n a,b = E + 4 X ^ ”1“ P J X + 

J=a 

= E Wa-f4 4 X Fj 4 P* X (Aj+if/j+i,* 4 Wj'4-2,fc+i/*4-i.i>))^ 

j=a ' 

b 

4 VVo+i.it-f i E Wfc+ 2 ,j'(Afj + 4 x + P; X (Aj+i/(j_|_i),i ) )J 

j==fc+i 

== ri a) fc 4 Wa-4~l,/c-f-l ,6 4 ^ ^ Wa-M»i(Pj X (^0“H1,A:1 ^/c-f-1,6) 1 

j=a ' / 

— *la,k 4 Wo-fl,fc + l((A^ +1 Ha,* : ) X /*+l,6 + «k + l,!>) 
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Appendix C — Derivation of the Logarithmic Time Offsets and C 

These may be derived from Table V.1 by inspection, by applying the rule that to insure 
minimum delay, the maximum delay of any variable must be caused by a data dependency 
on that variable. 

Avail(u> x „)> Avail{W Xt y) 

Avail(u> It y) > Avail(W x<y ) -\-VC- 1~ VA 
Avail{Cj x ^ y ) > Avail{(jj Xt y) -f- VC + VA 
Avail(R Xi y ) > Avail(W x ,y ) 

Avail(S x<y ) > Avail(W x , y ) + VC + VA 
Avail(S x ,y) ^ Avail(u Xt y) -f- VC VA 
Avail{S x ,y) + MV > Avail(R x ,y) + VC + VA 
Avail(Q x<y ) > Avail(W Xt y) 

Avail(p x y ) > Avail(W x<y ) -f- 2VC 3 VA 

Avail(j> xy ) > Avail{uj x .y) -f- 2VC + 3VA 
Avail(p x y ) > Avail(u> Xt y) + VC + 2VA 
Avail(p x y) + MV > Avail(R Xi y) + 2 VC + 3 VA 
Avail(p x J 4 -MV > Avail{S x ,y) + VC + 4 VA -f SV 
Avail(p x y ) + MV > Avail(Q Xt y) -f- V'C'-f 4VA + 5F 
+ MV > Avail(W x , y ) + 2 VC + 5VA + SV 

Avail{n X ' V ) > Avail(f x%u ) + VC + VA 

The delay conditions established, actual timing can be generated. From Table IV.1 we 
extract the a — b case; Table V.1 covers a ^ b. 


Auatf(W 0 , a ) = 0 


Avail(u at a) = MV 


Avail(uj a<a ) — MV 

(*) 

Avail(Q ata ) — MV 


Avail(R at a) — 0 


Avail(S a , a ) = MV + VC. 

n 


However, the equations marked (*) fail to satisfy the delay conditions and so must be revised 
to 

Avail(u ata ) = MV + VC+ V A 
Avail{S a ,a) = MV + VC + VA. 


(*) 

(from (*) above) 
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Analysis of Avail(p a J is less obvious, but proceeds as follows (assuming VC > SV and 
VC > VA) 


Avail(p 9 a ) = 0 
Avoi7(ff a A^„_i9 a ) = MV 
Avail(p 9 a + c a Alz a -iq a ) = MV 4- VA 
Avail(oj a , a X (w 3 , a X pi)) = MV + 2 VC 
Avail(p 9 a + &aAaZ a — iq a + w Q , a X (w a , a X p a )) = MV -f 2VC + VA 

Avail(dj a , a X p“ a ) — MV -f VC 
Av<nl(2u a . a X Q a ,„) — MV -f- SV + VC 
Avail(u llia X p a + 2w a , a X Q a ,a) = MV + SV -f VC + VA 
Avail(p a , a ) = MV -f 2VC + 2VA 
Avatl(p a>a ) = MV + 2VC + 3VA 


(*) 


where the last line is added so that p a<a satisfies the delay conditions (assuming MV > 
SV + 2VA). 

Since these satisfy the minimum delay conditions, propagation occurs at a rate of 
[MV + VA) per node. It can readily be seen that, in general, 


Avail(X a , b ) = Avail[X a , a ) + [log 2 (6- a + 1 )]{MV + VA). 


Thus in particular, if X, is the linear recursive variable corresponding to X a ,b, then X { — X 0ti 


so 


Avail[X,) — Avail(Xo,i) 

= Avoil(X a ,a) + [log 2 (i + 1)1 (MV + VA). 


Hence, 


Avail(<jj 0 'i ) = Avail(u aia ) + [log 2 (i + 1)](MV + VA) 

Avail(u 0il ) — Avail(w a .a) + pog 2 (t + 1)1(MV + VA) 

Avail(p 0tl ) = AvaH[p a , a ) + P°g2( l + 1)1(-WV + VA) 

Avail[r 0<i ) = ma.x(Avail(p o i ), Avail(uo,i) + VC -f- VA, Avai7(u> 0 ,i) + 2 VC A VA) -j- VA 
Ava%l(F 0> i) = Availfto't) -f SV 

= MV + SV + 2VC + 4VA + [log 2 (» + 1)1(MV + VA) 

Avail(No'i) = max(Avat7(w 0 , ; ) + VC, Avail(d>o,i)) + MV + VA 
= 2MV + VC + 2VA + flog 2 (t + 1)1(MV -f VA) 
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Thereafter, on the forward recursion (from Table IV.1), 


Avail{f ntn ) — Avail{F 0 , n ) 

= MV+ SV + 2 VC + 4 VA + flog 2 (n + 1)1(MV + VA) 

Avail{n n<n ) — ma.x(Avail(Fo !n ) + VC, Avail(N 0 , n )) + VA 
= MV + VC + 3 VA 

+ ma x(MV, SV + 2V<7 + 2VA) + flog 2 (n + 1)1(MV + VA) 

which satisfies the delay conditions. Propagation therefore occurs at the maximum rate and 

Avail(n n>0 ) = MV + VC + 3 VA 

+ max(MV, SV + 2 VC + 2 VA) + 2[log 2 (n + 1)]{MV + VA) 
Avail(r 0 ) = MV + VC + 3 VA 

+ ma x{MV, SV + 2 VC + 2 VA) + 2|Tog 2 (n + 1)1(MV + VA) 

= 2[log 2 (n + 1)](MV + VA) + MV + SV + 3 VC + 5VA 
= 2[log 2 (n + 1)](MV + VA) + 5 Mults + 10 Addns 
> Avail(ri) 

assuming again a maximally parallel system. 
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Appendix D — Unification of Logarithmic a = b and a ^ b Cases 

By the following technical artifice we can make the a — b case look like a ^ b. 

W a , k = I 
Wfc+i,6 = A a 

u a,k — &a z a —l9a 
Vk+l,b = 0 

&aZa —1 *7a 

Wfc + 1,6 0 

Qa,k — &a. z a — l9a 

Qk + l,b = 0 

Ra,k — 0 
Rk + l,b = P a 
S a ,k = 0 
Sk + l,b = 0 

Pa,k ~ &a°a — l9 a 
Pk + l,b = Pa 

Q a , a is substituted for Q k +i,b in p a , a 

Wa-fl,fc+l = I 

Wfc+2,6+1 = Aa+l 

/a, k ~Fa 
fk-i-1 .h — 0 
*la,k — N a 
nfc+ 1,6 = S* a X -Fa- 

Now following two applications of the (n/2) processor nodes to the n groups of input 
data we have X a , a as required, and similarly on the forward recursion. 
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